Scne  Acounuic  and  Perceptual  Correlates 
of  Speaket"  IdeiH; i  ■; icat ion 


By 

CONRAil  LOUIS  LARIVIERE 


A  DISSERTATION  PRESKNTED  TO  TblS  GRADUATE  COUNCIL  OF 
THE  WJIVEFlSITl'  OF  FLORIDA  IN  PARTIAL 
FULFILLMENT  OF  THE  REQUIRE^IEMTS  FOR  THE  DEGPEE  OF 
DOCTOR  OF  PHiLOSOFKY 


loSlVERSITY  OF  FLORIDA 
1971 


■  ACKNOWLEDGEMENTS 

The  author  gratefully  acknoT/ledges  the  guidance  and  support 
of  Dr.  Harry  Hollieu,  not  only  through  the  course  of  this  study, 
but  also  throughout  the  v;riter's  graduate  career  at  the  University 
of  Florida. 

Ihe  author  is  pleased  to  acknowledge  the  constructive 
comments  of  his  supervisory  committee,  composed  of  Drs.  Brcmdt, 
Ramey,  Jensen,  Dew  and  Algeo. 

Special  thanks  are  due  the  author's  wife.  Penny,  for  her 
unflagging  love  and  encouragement  and  the  author's  son,  Chris, 
who  could  not  care  less  about  the  initials  after  his  father's  name. 

Finally,  the  author  wishes  to  acknowledge  his  iniv.ense  debt 
to  his  parents,  Roland  and  Bella,  who  thought  mullers  unfit  places 
for  young  men. 


ii 


TABLE  OF  CONTENTS 


Page 

ACKNOWI.EDGEMENTS    i-i 

LIST  OF  TABLES    iv 

LIST  OF  FIGURES      v 

ABSTRACT    vii 

CHAPTER 

I              INTRODUCTION    I 

Review  of  the  Literature    5 

Statement  of  the  Problem    5 

II              PROCEDLTRE    9 

General  Experimental  Approaches    10 

Details  of  the  Experimental  Procedure    12 

III              RESULTS  AI'TD  DISCUSSION   27 

Sentences    27  ■ 

Vowels    30 

Consonants    39 

Utterance  Intelligibility  and  Speaker 

Identification    44 

Sample  Time  Interval   49 

Acoustic  Analyses    51 

Differences  Among  Listeners  and  Speakers    59 

IV              SUllMRY  AND  CONCLUSIONS    72 

APPENDIX  A    78 

APPENDIX  B    80 

APPENDIX  C    82 

APPENDIX  D  ,   34 

APPENDIX  E   88 

BIPLIO';RAPIiY    97 

BIOGRAPHICAL  SKETCH    100 


iii 


LIST  OF  TABLES 

TABLE  Page 

I.       ANALYSIS  OF  VARIANCE  SUMM/'.RY  FOR  LISTENERS' 

RESPONSES  TO  SENTENCE  STIMULI    28 

II,  ANALYSIS  OF  VARIANCE  SU>MARY  FOR  VOICED  VOIJEL  STIMULI,.  33 

III.  A  POSTERIORI  COMPARISONS  AMONG  VOICED  VOWELS    34 

IV,  ANALYSIS  OF  VARIANCE  SWCIARY  FOR  WHISPERED  VOWELS    36 

V.  A  POSTERIORI  COMPARISONS  AI-IONG  WHISPERED  VOWELS    37 

VI.       AVERAGE  FORMANT  FREQUENCY  AMPLITUDES  FOR  VOICED 

P.m  WHISPERED  VOWELS  (dB  DOWN  FROM  F-^  AMPLITUDE)   33 

VII.  ANALYSIS  OF  VARIANCE  SU>I>IARY  FOR  FILTERED  VOVJELS  .   40 

VIII.  ANALYSIS  OF  VARIANCE  SUl^fl^IARY  FOR  CONSONANT  STIMULI    42 

IX.  A  POSTERIORI  COMPARISONS  AMONG  CONSONANTS    43 

X.  ANALYSIS  OF  VARIANCE  SUMMARY  FOR  MONOSYLLABLES    52 

XI.       RANK  ORDER  CORRELATIONS  BETWEEN  ACTUAL  AND 

EXPECTED  CONFUSIONS  i\MONG  SPEAKERS  FOR  VOICED 

V01-/ELS    54 

XII.       KAllK  ORDER  CORl^iLATIONS  BETITEEN  x\CTUAL  AND 

EXPECTED  CONFUSIONS  AMOxNG  SPEAKERS  FOR  WHISPEllED  AND  FILTERED 
VOWELS    55 

XIII.       R.\NK  ORDER  CORRELATIONS  BETWEEN  ACTUAL  AND 

EXPECTED  CONFUSIONS  AMONG  SPEAKERS  FOR  CONSONANTS    57 

XIV.       P^ANK  ORDER  CORRELATIONS  BETWEEN  ACTUAL  AND 

EXPECTED  CONTUSIONS  AMONG  SPEAKERS  FOR  /va/    60 

XV.       FLfisDAMJjlNTAL  FREQUENCY  AND  FORM/.NT  FREQUENCY 

MEASUPilS  FOR  THE  VOICED  AND  WHISPERED  VOWELS    84 

XVI,       FIJNDAI-IENTAL  FREQUENCY,  FOkMANT  FREQUENCY,  A^E) 

FOm\NT  BANDWIDTHS  FOR  THE  CONSONANTS    85 

XVII.       FWIDA^IENTAL  FREQUENCY  AND  FORMANT.  FREQLTiNCY 

MEASUi^S  FOR  THE  MONOSYLIABLE ,  /va/    86 


iv 


LIST  01-  FIGURi'-S 

F IGURE  Page. 

1  SCHEMATIC  REPRESENTATION        GENERAL  PROCEDURE    15 

2  TllE  STRUCTURAL  SC.IEME  OF  TIffi  FACTORIAL  DESIGN; 

DATA  FOR  THE  CONSON-ANT  STIMULI   23 

3  LISTENER  PERFORMANCE  FOR  SENTENCES  PRODUCED  BY 

EACH  SPFJiKER   29 

4  OVERALL  LISTENER  PERFORMANCE  FOR  VOI-TEL  STIMULI    31 

5  OVERALL  LISTENER  PERFORMANCE  FOR  CONSONANT  STIMUI.I   41 

6  OVERALL  INTELLIGIBILITY  LEVELS  OF  THE  UTTEIUNCES 

EMPLOYED  IN  SPEAKER  IDENTITY  TASKS    45 

7  PROPORTION  OF  LISTENER  RESPONSE  TYPES  FOR 
UTTERANCES  WHERE  INTELLIGIBILITY  AND  IDENTITY 

JUDGMENTS  l-fERE  MADE    47 

8  OVERALL  LISTENER  PER'- ORi-L\NCE  FOR  EQUIVALENT 

DURATIONS  OF  MONOSYLLABLES  AND  ISOLATED  PHONEMES    50 

9  PERFORMANCE  BY  LISTENER  FOR  VOICED,  WIISPERED, 

AND  FILTERED  VOWELS    62 

10  PERFORI^IANCE  BY  LISTENER  FOR  VOICED  AND  VOICELESS 
CONSONANTS    63 

11  PERFORMANCE  BY  LISTENT^R  FOR  MONOSYLIABLE S    64 

12  PERFORMANCE  BY  SPEAKER  FOR  VOICED,  TOISPERED, 

AND  FILTERED  VOWELS    66 

13  FORMANT  FREQUENCY  VALUES  BY  SPEAKER  FOR 

IffllSPERED  /i/  AND  /u/    67 

14  FORMANT  FPvEQUENCY  VALUES  BY  SPEAKER  FOR 

WHISPERED  /t^  AND  /a/   ,   68 

15  PERFORMANCE  BY  SPEAKER  FOR  CONSONANTAL  STIMULI    69 

16  PERFORMANCE  BY  SPEAKER  FOR  MONOSYLLABIC  STIMULI    71 

17  CONF'USIONS  AMONG  SPEAKERS  FOR  VOICED  /i/    88 


V 


FIGURE  Page 

18  CONFUS.CONS  AKOXG  SPEAKERS  FOR  VOICED  /u/    88 

19  CONI'USIONS  AMONG  SPEAKK^S  FOR  VOICED  jh^-J    89 

20  CONFUSIONS  AMONG  SPEAKERS  FOR  VOICED  /a/    89 

21  COaTUSIONS  AMONG  SPEAKERS  FOR  WHISPERED  /i/    90 

22  CONFUSIONS  AMONG  SPEAKERS  FOR  WHISPERED  /u/    90 

23  CONFUSIONS  AMONG  SPEAKERS  FOR  WHISPERED    91 

24  CONFUSIONS  AMONG  SPEAKERS  FOR  WHISPERED  /a/    91 

25  CONFUSIONS  AMONG  SPEAKERS  FOR  FILTERED  /i/   92 

26  CONFUSIONS  AxMONG  SPEAKERS  FOR  FILTERED  /u/     92 

27  CONFUSIONS  AMONG  SPEAKERS  FOR  FILTERED  pSlJ    93 

28  CONFUSIONS  AMONG  SPEAKERS  FOR  FILTERED  /a/    93 

29  CONFUSIONS  AMONG  SPEAKERS  FOR  /s/    94 

30  CONFUSIONS  AMONG  SPEAKERS  FOR  /z/    94 

31  CONFUSIONS  AMONG  SPEAKERS  FOR  /f/    95 

32  CONFUSIONS  AMONG  SPEAKERS  FOR  /v/    95 

33  CONFUSIONS  AMONG  SPEAKERS  FOR  /v+a/    96 

34  CONFUSIONS  AMONG  SPEAKERS  FOR  /va/   96 


vi 


Abstract  of  Dissertation  Presented  to  the  Graduate  Council  of 
the  University  of  Florida  in  Partial  Fulfillment  of  the  Requirements 
for  the  Degree  of  Doctor  of  Philosophy 

so;-n5  ACOUSTIC  and  perceptual  correlates 
OF  speaker  identification 

3y 

Conrad  Louis  LaRiviere 

Chairman:  Harry  Hoi lien  • 
Major  Department:  Speech 

An  investigation  v/as  undercaken  concerning  the  ability  of 
listerors  to  identify  speakers  solely  on  the  basis  of  voice.  ITie 
purposes  of  this  study  were:     (1)  to  establish  the  relative  contri- 
butions of  source  and  vocal  tract  transfer  characteristics  to  speaker 
identification,   (2)  to  determine  whether  or  not  speaker  identifica- 
tion v;as  possible  on  the  basis  of  isolated  utterances  of  continuant 
cousonants,   (3)   to  investigate  the  nature  of  the  relation  between 
utterance  intelligibility  and  speaker  identification,  and  (4)  to 
determine  wliether  sample  duration  was  a  variable  in  speaker  identi- 
fication in  absolute  or  relative  terms. 

The  subjects  for  this  study  were  eight  male  speakers  and 
txvelve  listeners.     The  listeners  were  exposed  to  the  following 
speaker  utterances:     two  prose  sentences;  four  vowels  under  three 
conditien3--voiced,  whispered,  and  low-pass  filtered;   four  consonants 
tv.'o  consonant-vov;el  monosyllables,  one  natural  and  one  "synthetic." 

The  three  vowel  conditions  were  taken  to  simulate  the  presence 
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only  of:     (1)  source  information  (filtered),   (2)  vocal  tract  trans- 
fer information  (whispered),  or  (3)  both  (voiced).     The  monosyllables 
allowed  for  the  evaluation  of  the  role  of  duration  in  speaker  identi- 
fication.    The  listeners'   task  was  to  identify  the  speaker  they  felt 
produced  each  item.     For  the  vowel  and  consonant  stimuli,  they  were 
also  required  to  identify  the  utterance. 

Acoustic  analyses  of  the  speakers'  utterances  were  perfcnnsd 
and  the  parameters  extracted  were  then  correlated  with  the  confusions 
among  speakers.     The  parameters  extracted  included  fundamental  fre- 
quency,  the  first  three  formant  frequencies,  and  formant  bandwidths. 

The  results  indicated  that:     (1)  all  stimuli  yielded  speaker 
identification  performance  above  chance  levels,   (2)  sentences 
yielded  performance  far  above  any  other  stimuli,   (3)   the  perfor- 
mances achieved  for  whispered  vowels  and  filtered  vowels  were  very 
nearly  equal  and  summed  to  the  perforimnce  achieved  for  voiced 
vowels,   (4)  voiced  consonants  yielded  significantly  higher  per- 
formances than  voiceless  consonants,   (5)   the  natural  monosyllable 
resulted  in  higher  performance  than  the  "synthetic"  monosyllable, 
and  (6)  utterance  intelligibility  was  neither  a  necessary  nor 
sufficient  concomitant  to  speaker  identification. 

The  major  conclusions  reached  v/ere  that  there  seem  to  be  no 
acoustic  invariants  related  to  speaker  identification,  and  that 
speech  intelligibility  and  speaker  identification  appear  then  to 
be  qualitatively  different  percepts.     This  indicates  that  an  ade- 
quate model  for  phoneme  identification  would  not  necessarily  con- 
stitute an  adequate  model  for  speaker  identification,  and  vice-versa. 
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The  following  conclusions  also  oeemed  tenable:     (1)  speaker 
identification  for  vowels  seems  based  on  both  fundamental  frequency 
and  forwant  frequency  information,  and  the  influence  of  these  pa- 
rameters is  both  equal  and  additive,   (2)  speaker  identification  is 
possible  on  the  basis  of  isolated  continuant  consonants;  the  acoustic 
cues  responsible  for  the  identification  of  these  consonants  proved 
to  be  elusive,  and  they  deserve  further  investigation,   (3)  duration 
is  an  iraportant  variable  in  speaker  identification  performance  in 
that  it  allows  listeners  to  Siim.ple  added  information  on  the  basis 
of  some  integral  measure  of  multi-phonemic  utterances,  and  (4)  the  . 
very  high  perform.ance  yielded  by  the  sentence  stimuli  points  to  the 
possible  importance  of  suprasegmental  cues  to  speaker  identification. 
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I 

INTRODUCTION 

The  extraction  of  a  speaker's  identity  solely  from  his 
voice  is  a  familiar  perceptual  phenomenon  that  can  be  observed 
in  conversations,  cocktail  parties,  audio  portions  of  television 
broadcasts,  etc.     Indeed,  in  certain  circumstances,  speaker 
identification  from  voice  cues  is  crucial;  a  pilot,  for  example, 
must  be  able  to  isolate  and  attend  to  one  voice  amid  a  welter  of 
voices  if  he  is  to  guide  his  aircraft  appropriately;  an  individual 
will  not  reveal  confidences  over  a  telephone  unless  he  is  sure  of 
his  listener's  identity.     As  demonstrated  below,   there  is  ample 
experimental  evidence  to  support  such  anecdotal  observations. 

Superficially,   the  process  of  speaker  identification 
from  voice  imiy  seem  to  be  of  little  consequence,  constituting 
little  more  than  an  empirically  established  curiosity.  On 
closer  inspection,  however,  one  finds  several  compelling  reasons 
underlying  tlie  investigation  of  speaker  identity  from  voice  cues. 
Recently,   for  example,  there  has  been  considerable  interest  con- 
cerning .speaker  identification  from  spcctrographic  representations 
of  voice  ( 'voiceprints' )  .     llie  forensic  implications  of  such  a 
process  are  obvious,  buc  there  has  been  some  skepticism  and 
controversy  regarding  its  validity.     Bolt  et  al.   (1970)  point 
out  chat   (1)  experimencal  studies  of  voice  identification  using 
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visual  inspection  of  specti-ograiTis  have  yielded  false  identification 
rates  ranging  from  zero  to  637,,  depending  on  type  of  task  set  for 
the  observer  and  the  latter 's  training,   (2)  as  yet  experimental 
studies  designed  to  assess  the  reliability  of  identification  from 
voiceprints  under  practical  conditions  have  not  been  carried  out, 
and  (3)  no  experimental  studies  have  been  attempted  which  deal 
with  speakers  who  are  attempting  to  disguise  their  voices.  Tosi 
et  al. ' s  (1971)  recent  work  concerning  speaker  identification  from 
'voiceprints'  seems  to  represent  an  attempt  to  resolve  some  of 
these  issues.     Although  their  preliminary  results  are  promising, 
Tosi        al .   (1971)  note  a  need  for  a  great  deal  more  research  in 
this  area. 

One  must  also  countenance  the  (for  some)  Or^vellian  notion 
that  man's  voice  will  serve  as  his  universal  identifying  credential 
for  financial,  security  or  professional  matters.     For  example, 
Kersta  (1971)  has  reported  the  successful  use  of  machine  speaker 
recognition  for  security  purposes.     However,  he  worked  with  a  small 
(N  =  16),  closed,  and  cooperative  speaker  society,  and  warned  that 
extensions  of  this  approach  to  larger  societies  will  be  very 
difficult,     Tosi  et^  al .   (1971)have  also  stressed  the  manifest 
difficulties  of  using  large  and/or  non-cooperative  speaker  inven- 
tories. 

Another  motivation  for  investigating  speaker  identification 
from  voice  revolves  around  possible  implications  for  models  of 
speech  perception;  specifically,  the  relation  between  speaker 
identification  and  speech  intelligibility  presently  is  unkno\vn, 
3.P.  it  is  largely  uninvestigated.     Most  models  of  speech  perception 


3 


or  speech  recognition  posit  a  gross  preliminary  analysis  of  the 
signal  prior  to  entry'  into  either  a  strategy  section,  which  con- 
verges on  a  solution  through  an  iterative  process,  or  directly 
into  a  lexicon  (of  phonemes,  or  words, or  phrases,  etc.)j  where 
the  input  signal  is  matched  against  the  items  stored  there. 
Preliminary  analysis  (e.g.,  Halle  and  Stevens,  1962)   is  usually 
taken  to  consist  of  a  bank  of  contiguous  bandpass  filters,  whose 
output  somehow  serves  as  a  rough  guide  (in  the  frequency  domain) 
for  lexical  selection. 

If  the  perception  of  an  utterance  is  found  to  be  a  sufficient 
and/or  necessary  concomitant  to  speaker  identification,  then  it 
would  be  reasonable  to  infer  that  an  adequate  model  for  speech 
recognition  would  also  be  an  adequate  model  for  speaker  identifi- 
cation.    If  such  a  relation  does  not  obtain,  one  is  left  with  the 
prospect  that  speaker  identification  is  a  qualitatively-different 
percept  from  speech  recognition,  and  may  be  mediated  by  different 
cues  in  the  acoustic  signal.     It  may  be  tenable,  then,  to  ascribe 
the  speaker  identification  process  only  to  the  preliminary  analysis 
segment  of  speech  recognition  schemes. 

The  main  difficulty  in  all  these  apprcQches  is  that  there  ' 
exists  no  explicit  and  valid  model  for  speaker  identification. 
For  instance,  no  knovm  speech  perception  model  offers  or  enter- 
tains an  explanation  of  an  observer's  ability  to  identify  speakers; 
machine  recognition  schemes  (such  as  Kersta's,   1971)  rely  largely 
on  selected  time -frequency- amplitude  measures,  yet  no  systematic 
attempt  has  been  made  to  relate  such  measures  to  speaker  identifi- 
cation performance.     It  is  not  the  intent  of  this  study  to  propose 
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or  establisli  such  a  model,  for  it  seems  premature  to  formulate 
one  at  this  time.     Speech  is  first  and  foremost  an  acoustical 
signal,  and  v.-hatever  information  a  listener  can.  extract  from 
speech,   including  the  speaker's  identity,   is  coded  completely 
among  the  acoustic  characteristics  of  the  signal.     Note  that 
only  a  portion  of  such  characteristics  are  available  in  visual 
representations  of  voice,  and  an  observer  who  attempts  to  iden- 
tify a  speaker  from  spectrograms  may  well  be  at  a  distinct  dis- 
advantage from  one  who  can  avail  himself  of  the  original  acoustic 
signal.     Yet  there  remains  some  ignorance  concerning  the  acoustic 
correlates  to  speaker  identification.     The  literature  (outlined 
below)  concerning  speaker  identification  by  aural  approaches 
has  been  mainly  concerned  with  quantifying  the  extent  of  the 
effect;  there  has  yet  to  be  a  systematic  series  of  experiments 
attempting  to  relate  speaker  identification  to  the  acoustic 
parameters  of  the  signal.     Conceivably,  such  acoustic  correlates 
could  play  a  key  role  in  an  eventual  speaker  identification  model. 
TTius  the  focus  of  the  present  study  concerns  the  investigation 
of  some  possible  acoustical  and  perceptual  correlates  of  speaker 
identification  from  voice. 


It  should  be  noted  at  the  onset  that  some  liberties  with 
traditional  terminology  have  been  taken  in  the  literature.  The 
terms  'speech'  and   'voice'  have  been  usually  taken  to  refer  to 
different  phenomena,  the  former  dealing  V7ith  linguistic  events 
(e.g.,  events  relating  to  the  phonemes  of  some  natural  language) 
and  the  latter  dealing  with  non- linguistic  events  (e.g.,  laryngeal 
parameters) .     Although  all  of  the  known  research  concerning  aural 
speaker  identification  utilizes  'speech'  cues  (texts,  sentences, 
phonemes),  the  common  appelation  for  such  a  paradigm  is  'speaker 
identification  from  voice. '     One  suspects  that  this  is  due 
primarily  to  the  clumsiness  of  'speaker  identification  from  speech' 
in  any  event,  the  phenomenon  of  identifying  a  speaker  solely  from 
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Review  of  the  Literature 

Pollack  et_  al .    (1954)  V7ere  among  the  first  to  empirically 
establish  that  listeners  could  idnntify  speakers  on  the  basis  of 
voice  alone.     Their  speakers  read  identical,  unspecified  texts 
under  two  conditions:     voiced  and  v/hispered.     The  independent 
variables  were  duration  and  filtering.     Results  for  the  voiced 
samples  indicated  that  speaker  identification  scores  were  directly 
proportional  to  duration  up  to  1.2  seconds,  beyond  v;hich  there 
was  no  improvement  in  performance.     Listener  performance  under 
varying  high-  and  lov7-pass  filtering  conditions  suggested  that 
identification  perform.ance  was  not  "critically  dependent  upon 
any  delicate  balance  of  different  frequency  components  in  any 
single  portion  of  the  speech  spectrum."    On  the  other  hand,  for 
the  whispered  productions  it  was  found  that  performance  was 
equivalent  to  the  voiced  condition  if  duration  was  increased  by 
a  factor  of  three.     The  authors  concluded  that  duration  was  a 
significant  variable  in  speaker  identification  performance,  in 
that  it  admitted  larger  samplings  of  a  speaker's  repertoire. 

Compton  (1963)  used  only  sustained  productions  of  the 
vowel  ,/i/, and  varied  duration  and  filtering  conditions.  He, 
too,  found  that  performance  increased  with  duration  only  up  to 
durations  of  1250  msec.     Moreover,  he  found  that  filtering 
frequencies  above  1020        substantially  reduced  speaker  identi- 
fication performance  and  filtering  frequencies  below  1020  Hz 


his  utterances  will  be  universally  referred  to  here,  for 
couformability,  as  speaker  identification  from  voice. 
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had  no  significant  effect  on  performance.     Compton  concluded  that 
confusions  among  spcakei-s  were  largely  explained  by  similarities 
of  their  fundamental  frequencies,  yet  he  reported  no  attempt  to 
account  for  the  confusions  on  any  other  basis. 

Bricker  and  Pruzanski  (1966)  used  a  variety  of  speech 
materials  as  stimuli  in  a  speaker  identification  task.  Of 
particular  interest  here  are  their  results  when  the  stimuli  x-jere 
vowels  (V)  as  opposed  to  consonant  vowel  monosyllables  (CV) : 
when  the  V's  and  C V s  were  presented  to  listeners  at  identical 
durations  (14Q  msec),   they  found  that  CV  stimuli  yielded  signi- 
ficantly better  identification  performance  than  the  V  stimuli, 
lliey  concluded  that  the  number  of  phonemes  in  a  speech  sample 
correlated  more  closely  with  identification  performance  than  did 
the  sample's  absolute  duration.     They  also  noted  that  confusions 
among  speakers  were  not  independent  of  the  vowel  uttered--  i.e., 
the  pattern  of  speaker  confusions  differed  over  vowels.  Stevens 
e_t  aj^  (1968)  also  noted  vowel  effects;  specifically,  they  found 
that  utterances  containing  a  front  vowel  (/i/)  yielded  higher 
identification  scores  than  utterances  containing  a  back  vowel 
(/a/).     They  speculated  that  this  latter  result  may  have  been  due 
to  the  importance  of  the  second  formant  as  a  cue  to  speaker  iden- 
tification, since  front  vowels  are  characterized  by  a  xvide  frequen- 
cy gap  between  the  first  and  second  formant  and  a  high  absolute 
frequency  location  of  the  second  formant. 

Statem.ent  of  the  Problem 

In  the  acoustic  domain,  the  basic  parameters  of  a  speech 
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signal  are  time,  frequency  and  amplitude.     Given  constant  amplitude, 
the  variables  of  interest  are  time  and  frequency.     Phonemic  effects 
have  been  noted  in  previous  experimentation  and  will  also  be  con- 
sidered here.     To  summarize  the  literature  x>7ith  respect  to  these 
influences: 

^-     Duration.   Some  evidence  (Compton,   1963)   suggests  that  speaker 
;■       identification  is  a  function  largely  of  absolute  duration. 
Bricker  and  Pruzanski  (1966),  Pollack  et  aj^  (1954),  and  by 
inference,  Stevens  et  al .   (1968)   suggest  that  duration  is  im- 
portant only  insofar  as  it  allows  listeners  to  sample  a  larger 
repertoire  of  the  speaker's  phonemes.     Perhaps  a  more  cogent 
approach  to  the  role  of  duration  in  speaker  identification  is 
to  consider  its  contributions  in  terms  of  inform.ation--i.e. , 
increases  in  duration  yield  more  information  about  the  speaker. 
The  problem  then  becomes  how  and  where  the  added  information  is 
coded  in  the  speech  signal. 

B,  Frequency.     Although  Compton  (1963)  has  attributed  speaker 
confusions  to  fundamental  frequency,  he  reports  no  attempt  to 
explain  the  confusions  among  speakers  on  any  other  basis,  e.g., 
formant  frequencies.     Indeed,  the  speculations  of  Stevens  et 
al_^  (1963)  suggest  that  the  second  formant  may  deserve  a  closer 
look  as  a  possible  source  of  speaker  confusion,  and  further 
research  in  this  area  seems  indicated. 

C.  Phonem.ic  effects.     Bricker  and  Pruzanski  (1966)  hava  shown 
that  speaker  confusions  vary  with  the  vowel  uttered,  Stevens 
et  al.   (1968)  noted  that  front  vowels  yield  higher  scores  than 
back  vowels.     There  are  no  known  investigations  concerning  speaker 


identification  from  continuant:  consonants  (although  Schwartz, 
1968,  has  successfully  used  continuant  consonants  as  stimuli 
in  a  sex  identification  experiment),  or  the  effects  of  con- 
sonant to  vowel  formant  transitions,  or  the  relation,  if  any, 
between  sample  intelligibility  and  speaker  identification. 
A  systematic  attempt  to  resolve  the  inconsistencies  and 
gaps  noted  in  this  review  may  serve  as  a  useful  preliminary  to 
the  eventual  establishment  of  a  model  for  speaker  identification 
and  \'70uld  seem  to  constitute  a  viable  general  problem  for  research. 
Hence,   the  research  proposed  here  addresses  itself  to  four  specific 
problems  for  research: 

A.  To  what  extent  do  source  characteristics  (e.g.,   fundamental  . 
frequency)  and  vocal  tract  transfer  characteristics  (e.g., 
formant  center  f requeiii_ies)  contribute  to  speaker  identifica- 
tion and  confusions  among  speakers? 

B.  Is  speaker  identification  possible  from  isolated  samples  of 
continuant  consonants? 

C.  Wliat  is  the  relation  between  utterance  intelligibility  and 
speaker  identification? 

D.  Does  the  time  interval  of  a  sample  (herein  referred  to  as 
duration)  contribute  to  speaker  identification  from  an  absolute 
standpoint,  or  is  it  significant  only  because  it  allows  the 
listener  to  sample  more  of  the  phonemes  in  a  speaker's  repertoire? 
If  the  latter,   is  the  excra  information  coded  in  target  or 
transitional  phonemic  cues? 


II 


PROCEDURE 

Since  the  experimental  methods  employed  in  this  study  were 
rather  involved,  an  overview  of  the  general  experimental  approaches 
related  to  each  specific  problem  for  research  is  discussed  initially 
in  this  section.     This  is  followed  by  a  description  of  the  procedur- 
al details  employed  in  the  actual  experimentation  --  e.g.,  subject 
and  utterance  selection,  utterance  treatments,  data  reduction,  etc. 

The  scope  of  the  present  experimentation  was  limited  to 
the  identification  of  a  closed  set  of  speakers  from  auditory  cues 
alone.     The  speech  samples  addressed  to  the  specific  research  prob- 
lems were  restricted  to  'simple'  utterances  --  vov;cls,  consonants, 
and  consonant-vowel  monosyllables  --  for  two  reasons:     (a)  there 
is  ample  empirical  evidence  that  speaker  identification  can  be 
reliably  accomplished  from  such  samples,  and  (b)  sample  control 
(for  duration  and  amplitude)  and  sample  analysis  (for  the  extraction 
of  acoustic  parameters)  are  m.uch  more  easily  achieved  for  these 
utterances  than  for  more  complex  materials  (e.g.,  sentences).  The 
lacter,  however,  contain  speech  characteristics,  such  as  rate  of 
articulation,  inflectior,  and  dialect,  which  are  i.ot  present  in 
the  simple  utterances  used  in  this  study.     In  order  to  obtain 
some  estimate  of  the  contributions  of  such  characteristics,  two 
prose  sentences  were  also  used  in  this  study. 
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General  ExpGX'imental  Approaches 

Concerning  the  first  research  question,   tlie  evidence 
noted  above  (Compton,   19o3)  that  swstained  vov/els  yield  speaker 
identification  information  helps  to  narrow  down  the  possible 
acoustic  cues  ivhich  the  lisfeners  avail  themselves  of,  for  a 
'steady  state'  vowel  raay  be  described  in  terms  of  its  source 
(fundamentcil  frequency  and  spectral  overtones)  and  vocal  tract 
transfer  (formant  structure)  characteristics  (Fant,  1960).  The 
question  remains  whether  either  of  these  characteristics  alone  is 
sufficient  for  speaker  identification,  or  whether  their  contribu- 
tions are  additive.     Although  it  is  functionally  impossible  to 
uncouple  the  source  (larynx)  and  the  vocal  tract,  an  attempt  was 
made  to  simulate  such  an  uncoupling.     The  speakers  were  asked  to 
produce  isolated  vowels,  voiced  and  whispered.     These  were 
presented  to  listeners  under  three  conditions: 

(a)  voiced:     contained  both  source  and  transfer  characteristics. 

(b)  whispered:    contained  primarily  transfer  characteristics. 

(c)  voiced,  lew-pass  filtered  below  the  first  formant  (at  200Hz) : 
prim.arily  simulated  source  characteristic. 

The  second  research  problem  was  examined  by  having  identity 
judgments  made  of  continuant  consonant  cognate  pairs.  Continuants 
were  selected  because  they  could  be  produced  and  presented  in 
isolation  and  their  durational  characteristics  could  be  controlled 
without  affecting  cheir  intelligibility.     Cognate  pairs  were  chosen 
because  they  differ  in  the  presence  or  absence  of  fundamental  fre- 
quency within  each  pair,   so  that  the  data  collected  here  could 
also  be  addressed  to  the  first  research  problem,   i.e.,   if  speaker 
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identification  is  largely  coded  in  fundamental  frequency  informa- 
tion, one  would  expect  significantly  better  speaker  identification 
scores  for  the  voiced  consonants  than  for  the  unvoiced  consonants. 

The  third  problem  concerns  possible  relationships  between 
speech  intelligibility  and  speaker  identification.     It  was  investi- 
gated by  having  listeners  make  intelligibility  judgments  for  the 
va\?el  and  consonant  stimuli  simultaneously  with  speaker  identity 
judgments.     A  limited  inventory  of  vowels  and  consonants  (four  of 
each)  were  produced  by  the  speakers.     The  listeners  were  apprised 
of  the  inventory  and  forced  to  make  intelligiblity  judgments  as 
well  as  speaker  identity  judgm.ents. 

The  fourth  problem,  regarding  the  role  of  sample  time- 
interval,  was  investigated  by  comparing  the  listeners'  identifi- 
cation performance  for  equivalent  durations  of  isolated  phonemes 
vs  cousonant-vowel  monosyllables.     T\<io  types  of  consonant- vowel 
monosyllables  were  em.ployed.     One  was  a  "synthetic"  monosyllable, 
assembled  from  sustained  productions  of  a  continuant  consonant 
and  a  vowel  used  above.     The  other  was  a  naturally  produced  mono- 
syllable, containing  the  saine  two  phonem.es  as  the  "synthetic"  mono- 
syllable.    Note  that,  given  equivalent  time- intervals  (duration), 
one  obtains  a  progression  of: 

(a)  continuant  consonant:     one  phoneme. 

(b)  vowel:     one  phoneme. 

(c)  synthetic  monosyllable:     two  contiguous  phonemrs,  but 
lacks  consonant  to  vowel  formant  transitional  informa- 
tion. 

(d)  iiatural  monosyllable:     two  phonemes,  intact  transi- 
tional information. 
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Such  an  ensemble  provides  an  opportunity  to  evaluate  the  informa- 
tional aspects  of  duration  in  speaker  identification.     If  duration 
is  important  only  in  absolute  terms,  one  expects  no  significant 
difference  in  listener  performance  over  these  stimuli,  since  they 
have  equal  durations.     If  increases  in  duration  are  consequential 
because  they  allow  listeners  added  information  in  terms  of  the 
steady  state  portions  of  a  greater  number  of  phonemes,  then  there 
would  be  no  significant  difference  between  the  natural  and 
"synthetic"  monosyllables,  but  the  latter  tv;o  V70uld  yield  higher 
scores  than  the  continuant  consoiiant  or  vowel  stimulus  alone. 
Finally,   if  the  import  of  duration  is  that  it  allows  the  sampling 
of  consonant  to  vowel  transitional  information,   the  natural  mono- 
syllable should  yield  higher  identification  scores  than  all  other 
st  iniuli . 

Details  of  the  Experimental  Procedure 

A .     Sub  iects . 

(1)  Speakers.     Kight  male  speakers  were  used  in  this  study. 
Selection  criteria  were:     (a)  age  range  twenty  to  thirty- five  years, 
(b)  no  known  speech  defects,   (c)  speakers  of  general  American 
English. 

(2)  Listeners.     The  observers  were  twelve  individuals 
free  from  any  know.'  hearing  defects,  who  had  been  in  routine  con- 
tact with  each  of  the  speakers  for  a  period  of  at  least  six  months, 
it  was  rensoiiable  to  assume,   then,  that  the  listeners  were  familiar 
with  the  speakers'  voices  and  that  practice  sessions  in  the  actual 
experimentation  would  not  be  required. 
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13. 

B.     Stimulus  Natei'ials. 

Selection  of  vowels.  Because  of  the  observations  of  Stevens  et_  al . 
(1968)  and  Bricker  and  Pruzanski  (1956)  noted  above,  two  fi-ont 
vowels,   /i/  and  M^/ ,  and  two  back  vowels,   /u/  and  /a/,  were  chosen 
for  speaker  production.     Note  that  within  the  categories  front 
vowel-back  vowel,  one  of  the  selected  vowels  is  a  high  vowel  and 
the  other  a  low  vowel.     Peterson  and  Barney  (1952)  demonstrated 
that,  for  any  given  speaker,  fundamental  frequency  probably  does 
not  vary  significantly  over  these  vowels,  that  front  vowels  are 
character i::ed  by  a  high  second  formant  vs  back  vowels,  and  that 
high  vowels  are  characterized  by  a  low  first  formant  frequency  vs 
low  vowels.     Moreover,  these  particular  vowels  were  chosen  because 
they  occupy  extreme  positions  in  the  traditional  vowel  diagram;  hence 
their  formant  frequency  structures  should  show  viide  contrasts 
over  a  range  of  speakers.     Additionally,  the  use  of  four  vowels, 
as  noted  above,  allows  for  incorporating  a  closed  set  of  utterances 
for  the  generation  of  intelligibility  and  confusion  measures. 
As  outlined  above,   the  talkers  produced  these  vowels  in  tv/o  condi- 
tions, voiced  and  whispered. 

.Selection  of  consonants.  Four  continuant  consonants  were 
used:     / f ^  v,  s,  z/ ,     Four  consonants  were  used  because,  again, 
such  a  selection  allowed  for  the  generation  of  intelligibility 
and  confusion  data.     These  particular  consonants  were  chosen 
because  of  their  high  frequency  of  occurrence  in  English  and 
because  they  are  relatively  easy  to  produce  and  identify  in 

isolation.  Additionally,  some  data  regarding  their  acoustic  'i 
characteristics  are  available  in  the  literature . (Flanagan,  1965; 
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Harris,  1953;  Ueinz  and  Stc-vcus,   1961;  Hughes  and  Halle,  1956). 

Selection  of  consonant:  ■yowgj^rQoiio^^^^  The  talkers 

produced  the  T?.onosyllable  ,/va/ .     This  inonosyllable  was  selected 
because  its  two  phonemes,  /v/  and  /a/,  v/cre  also  produced  in 
isolation  by  each,  speaker.     From  these  isolated  productions,  a 
synthetic  monosyllable,   /v  +  a/,  was  assembled  by  a  procedure 
detailed  below  in  section  C. 

Overall,   these  stimulus  materials  sufficiently  address 
themselves  to  the  specific  questions  for  research  outlined 
above:     the  vowel  and  consonant  conditions  allov/  the  evaluation 
of  the  contributions  of  source  and  transfer  characteristics,  the 
use  of  several  vowels  and  consonants  allow  for  the  evaluation  of 
possible  relations  between  utterance  intelligibility  and  speaker 
identification, and  the  m.onosyllables  should  yield  some  insight 
into  the  informational  role  of  sample  tim.e  interval  (duration). 
C.     Preparing  Stimulus  Materials. 

Figure  1  is  a  schematic  representation  of  the  four  principal 
steps  employed  in  the  procedure  of  this  study. 

Recording  conditions.     Each  talker  was  seated  in  a  sound 
treated  roora  (lAC  1204 -A)  and  positioned  approximately  six  inches 
from  a  dynamic  microphone  (Electro-Voice  664).     All  utterances 
were  recorded  at  7%  inches  per  second  on  a  single-track  tape 
recorder  (Magnecord  PT6-1),   located  outside  the  room. 

The  speakers  were  asked  to  produce  the  following  utcerances: 
(a)  name,   (b)  the  first  two  sentences  of  the  Rainbow  Passage  (Fair- 
banks, 1956),  using  natural  rate  and  inflection,   (c)  five  three- 
second  oroductions  of  each  vowel,  whispered,   (e)   five  rhree-second 
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Figure  1:     Schematic  representation  of  general  procedure. 
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productions  of  each  consonant,  and  (f)  five  productions  of  Ival , 
in  which  they  prolonged  the  final  vowel  for  two  seconds. 

Speakers  were  instructed  to  achieve  a  constant  VU  meter 
deflection  (-2)  for  all  of  their  utteiances.     Their  performance 
v/as  monitored  by  both  speaker  and  experimenter. 

At  the  time  of  the  recordings,  a  black  and  v^7hite  photo- 
graph T/as  taken  of  each  speaker  against  a  flat  background  with 
a  high  quality  35ram  camera.  Eight-by-ten  inch  enlargements  of 
these  photographs  were  obtained  for  use  i.n  the  listening  sessions. 

Utterance  selection  and  treatment.      The  five  productions 
of  each  stimulus  were  evaluated  by  a  panel  of  four  experienced 
listeners  (phoneticians).     The  latter  chose  the  most  representa- 
tive production  of  each  stimulus  from  the  five  original  produc- 
tions made  by  the  speakers.     It  V7as  these  "preferred"  productions 
which  were  treated  for  duration,  filtered  where  appropriate,  and 
repeated  and  randomized  for  the  actual  experimental  listening 
tapes.     Instructions  to  the  evaluators  are  contained  in  Appendix 
A. 

One  duration,   1250  milliseconds,  was  used  for  all  stimulus 
materials.     This  duration  was  selected  on  the  basis  of  the  obser- 
vations of  Compton  (1963)  and  Pollack  et  aU   (1954)   that  durations 
ex--:;aeding  this  had  no  effect  cn  identification  performance.  The 
selected  duration  was  generated  in  the  following  way:     the  "pre- 
ferred" utterances  served  as  input  to  an  electronic  switch  (Grason- 
Stadler  ) ,  whose  duty  cycle  (1250  m.sec.)  was  manually  initiated 

by  the  experimenter  at  the  beginning  of  each  utterance.  Rise  and 
fall  times  were  set  at  25  milliseconds.     This  selection  was  based 
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on  data  from  Bricker  and  Pruzanski  (1966);  they  reported  that 
rise  aad  fall  tirr:es  above  15  msec,  did  not  introduce  any 
artifactual  consonantal  effects.     The  output  of  electronic  sv/itch 
was  monitored  by  a  cathode  ray  oscilloscope  (Tektronix  56''0  . 

In  order  to  construct  the  synthetic  /v  +  a/  utterances,  the 
following  procedure  was  used:     600  millisecond  excerpts  of  a  sus- 
tained production  of  /v/  were  obtained   (as  described  above)  and 
recorded  on  one  channel  of  a  Sony  530  recorder;  similarly,  a  620 
msec,  excerpt  of  /a/  was  recorded  on  the  second  channel;  a  tape 
loop  was  constructed  containing  both  of  the  phonemes  thus  recorded. 
This  tape  loop  was  placed  on  PAMMS  (a  pause  adjustment  laochanism 
and  measurement  system  developed  by  Jensen  et  ad . ,   1970),  a  device 
which  allows  a  variable,  monitorable  delay  to  be  introduced  between 
the  material  on  each  channel.     The  delay  was  adjusted  to  approxi- 
mately 5  msec,  and  the  output  of  PAMl-IS  was  then  recorded  on  a 
Sony  530  recorder  at  7.5  i.p.s. 

To  generate  the  filtered  vowels,  the  1250  msec,  excerpts  of 
the  voiced  vowels  were  low-pass  filtered  at  200  Hz.     The  frequency 
response  of  the  filter  (Krohn-Kite  3100)  showed  an  attenuation  rate 
of  23  dB/ octave. 

Generating  the  experim.ental  tapes.     Six  experimental  tapes, 
one  for  each  stimulus  category  (i.e.,  sentences,  voiced  vowels, 
whispered  vowels,  filtered  vowels,  consonants,  and  monosyllables) ^ 
V7ere  generated.     In  all  cases,   the  interst imulus  interval  was  four 
seconds;  each  production  was  repeated  five  times  and  randomized 
according  to  a  random  number  table.     Since  the  listener  response 
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forms  contained  25  items  per  page,  a  ten  second  interval  follov/ed 
every  25th  item  on  all  tapes.     It  should  be  noted  that  there  was 
no  direct  correspondence  between  any  experimental  tape  and  the 
specific  problems  for  research.     The  grouping  of  utterance  types 
on  these  tapes  was  solely  to  facilitate  the  listeners'  task  and  the 
analysis  of  data. 

Experimental  Tape  1  consisted  of  the  voiced  vowels.  The 
eight  speakers  had  produced  four  different  vowels  and,  since 
each  stimulus  was  repeated  five  times,  this  tape  consisted  of 
160  stimulus  items.     This  construction  also  applies  for  Experi- 
mental Tapes  2,  3,  and  4,  which  contained,  respectively,  whispered 
voxrcls,   filtered  vowels,  and  consonants.     Playback  time  for  T&pes 
I    through    4     was  approximately  15  minutes  each. 

The  fifth  experimental  tape  contained  all  of  the  conso- 
nant-vowel monosyllables.     There  were  two  monosyllables  for  each 
speaker,  and  each  monosyllable  was  repeated  five  times,   so  that 
Tape  5  consisted  of  80  stimulus  items.     The  playback  tim.e  for 
Tape  5  was  approximately  7.5  minutes. 

Additonally,  a  control  tape  was  assembled  consisting  of 
five  randomized  repetitions  of  each  subject's  sentence  produc- 
tions.    The  data  collected  from  this  tape  were  used  as  a  perfor- 
mance baseline  over  speakers  and  1  is teners-- i.e . ,  the  listeners 
here  could  avail  themselves  of  relatively  long  durations  of  natural 
speech  (including  speaking  rate  and  inflection,  as  well  as  phonemic 
effects),  so  their  performance  here,  then,  was  used  as  a  metric  in 
assessing  their  proficiency  for  all  other  stimuli. 
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D .  Acoustical  Analysis . 

Since  one  of  the  objectives  of  this  study  is  to  assess  to 
what  extent  speaker  identification  and  confusions  among  speakers 
may  be  accounted  for  by  source  and  transfer  characteristics  of  the 
vocal  mechanism,  each  talker's  "preferred"  utterances  from  Section 
C,  above,  were  subjected  to  acoustical  analyses  for  the  extraction 
of  the  following  parameters:     fundamental  frequency,  formant  fre- 
quencies, and,   for  the  voiceless  consonants,  formant  bandwidths. 

Fundamental  frequency  was  determined,  for  all  voiced  stimuli, 
by  oscillographic  analysis  (Honey^"7ell  Visicorder,  operated  at  15 
inches  per  second).     Formant  frequencies  were  estimated  from  wide- 
band spectrographs  generated  by  a  Kay  PUectric  6061  spectrographic 
unit.     The  protocols  suggested  by  Dew  ejt  a_l,   (1969)   for  such 
measurements  were  followed.     Formant  bandwidths,   for  the  voiceless 
fricatives, were  estimated  from  narrow-band  amplitude  sections.  Band- 
widths  were  measured  at  a  point  3  dB  down  from  the  peak  of  each 
formant  (this  was  found  to  correspond  to  a  distance  of  5  mm  on  the 
spectrographic  paper). 

E .  Stimulus-Response  Task. 

Portraits  of  each  spor;ker  were  attached  to  the  wall  of  the 
listening  room  (lAC  1204  A).     Immediately  below  each  portrait,  the 
initials  of  the  individual  portrayed    were      printed  in  large  block 
letters.     Instructions  given  the  listeners  are  shown  in  Appendix  B. 

For  the  control  tape  and  Experimental  Tape  5,   uhe  listeners 
were  presented  with  answer  sheets  denoting  the  stimulus  number, 
followed  by  the  eight  sets  of  identifying  speaker  initials.  The 
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listeners'   task  was  to  simply  circle  the  initials  of  the  speaker 
whom  they  felt  produced  that  stimulus  item. 

For  Experimental  Tapes  1  through  4,  the  answer  sheet  con- 
sisted of  the  stimulus  number,  followed  by  the  catalogue  of  utter- 
ances in  each  tape,  and  the  speakers'   initials.     The  listener  was 
asked  to  circle  the  utterance  heard  and  circle  the  initials  of  the 
speaker  whom  they  felt  produced  it.     An  example  of  an  answer  sheet 
the  listeners  used  when  they  were  hearing  Experimental  Tape  4  is 
contained  in  Appendix  C. 
F .     Playback  Conditions . 

All  experimental  tapes  were  played  from  an  Ampex  351  tape 
recorder,  through  one  channel  of  a  Marantz  Model  7  pre-amplif ier 
and  Marantz  8-B  power  amplifier.     The  loudspeaker,  an  acoustic 
suspension  device  (AR-4) ,  was  located  in  a  sound  treated  room 
(lAC  1204 -A). 

Stimuli  were  presented  to  listeners  at  70  dB  SPL.  One 
or  two  observers  per  listening  session  were  used,  seated  equi- 
distantly  (approximately  three  feet)  from  the  loudspeaker.  A 
sound  level  meter  (General  Radio),  positioned  where  listeners  were 
to  be  seated,  v;as  used  as  a  calibration  device.     Calibration  was 
accomplished  via  a  1000  Hz  tone  which  was  recorded  at  the  same  VU 
level  (-2)  as  the  speech  samples.     The  RMS  voltage  at  the  loud- 
speaker's input  corresponding  to  70  dB  SPL  in  the  room  was  noted 
on  a  vacuum  tube  voltmeter  (Ballantine  321-C),  and  the  latter  was 
monitored  by  the  experimenter  throughout  the  listening  sessions. 

All  listeners  heard  the  control  tape  first.     The  other  ex- 
perimental tapes  were  presented  in  random  order.     All  listening 
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sessions  were  conducted  over  a  period  of  five  days.     All  listeners 
indicated  their  responses  by  circling  the  appropriate  items  with  a 
black  ink  pen.     Listener  response  forms  were  graded  by  overlaying 
them  on  a  coded  key,  on  which  the  correct  responses  \i7ere  circled 
in  colored  ink,  each  color  denotiiig  the  particular  utterance  and 
speaker  involved  in  a  given  trial.    All  responses  by  a  given  listener 
were  then  transferred  to  confusion  matrices  for  each  type  of  utterance- 
for  example,  actual  speaker  vs  perceived  speaker  for  whispered  /i/, 
etc.  The  diagonals  of  these  matrices  represented  correct  listener 
responses  and  were  used  in  all  analysis  of  variance  procedures. 

Although  this  grading  and  response  summarization  procedure 
was  a  rather  laborious  one,  it  should  be  noted  that  this  study  in- 
volved 9920  separate  listener  responses.     It  is  felt  that  the  pro- 
cedure used  here  was  not  only  time  saving,  in  the  long  run,  but 
also  tended  to  minimize  the  intrusion  of  experimenter  error,  since 
the  number  of  responses  in  any  given  matrix  may  be  summed,  after 
the  scoring  procedure,  and  used  as  a  check  for  errors  (for  example, 
a  given  listener's  responses  for  v/hispered  /i/  must  sum  to  40,  etc.). 
As  a  further  check,  all  responses  were  also  graded  by  an  independent 
observer;  the  few  discrepancies  which  were  found  were  then  resolved. 

Although  the  possibility  of  experimenter  error  in  a  paper 
and  pencil  response  paradigm  of  the  magnitude  used  here  is  always 
high,  it  is  felt  that  the  techni.que  employed  tended  to  minim.ize  this 
unwanted  source  of  variation  to  the  point  where  it  is  not  a  signi- 
ficant factor  in  the  results  presented  in  this  investigation. 
•     Data  Reduction . 

The  paradigm  used  here  is  a  randomized  block  factorial  design, 


representing  a  mixed  model  (utterances  are  considered  fixed  effects, 
speakers  and  listeners  are  considered  random  effects).     The  con- 
sonantal conditions,   for  example,   involve  the  following  factors: 
eight  levels  of  speakers  and  four  levels  of  consonants.  Each 
listener  is  considered  as  a  block,  since  each  is  exposed  to  all 
stimuli. 

In  his  discussion  of  randomized  block  designs.  Kirk  (1968) 
notes  that  it  is  not  necessary,  for  mixed  models,  to  assume  that 
the  block  and  treatment  effects  are  additive  in  order  to  test  the 
treatment  effects.     Consequently,  block  (listener)  by  treatment 
(speaker  and  utterance)  interactions  were  not  considered  in  the 
statistical  analyses  employed  in  this  study. 

Factorial  designs  of  this  sort  allow  the  analysis  of  two 
or  more  variables,  both  in  terms  of  their  individual  ('main') 
effects  and  in  terms  of  their  interactions  with  one  another.  The 
presence  of  significant  differences  among  treatment  means  is  deter- 
mined by  an  analysis  of  variance  procedure  which  utilizes  the  F 
distribution.     Tliis  procedure  yields  information  concerning  factor 
effects  and  interactions;  when  any  of  the  fixed  effects  factors 
show  significant  effects,  com.parisions  among  treatment  level  means 
were  made  in  order  to  determine  the  contributions  of  the  levels 
within  each  factor  to  the  test  scores.     Tlie  procedure  employed  here 
involved  a  posteriori  comparisons  among  means  using  Scheffe's 
method  (Kays,  1963). 

Figure  2  shows  the  structural  design  for  the  consonantal 
stimuli.     The  designs  for  the  vowels  and  monosyllables  are  similar. 
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Figure  2:     'Ihe  structural  scheme  of  the  factorial  design;  data 
for  the  consonant  stimuli. 
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Concerning  the  relation  of  speaker  identification  and  utter- 
ance intelligibility,   it  should  be  noted  that,   for  all  vowel  and' 
consonant  stimuli,   four  types  of  responses  are  poss ible-- that  is, 
identification  is  either  correct  or  incorrect  and  utterance  intelli- 
gibility is  either  correct  or  incorrect.     An  inspection  of  the  dis- 
tribution of  these  responses  v.'as  made  in  order  to  ascertain  whether 
or  not  utterance  intelligibility  is  a  necessary  and/or  sufficient 
condition  for  speaker  identification. 

Confusion  Matrices.     Confusion  matrices,  similar  to  those  em- 
ployed by  Miller  and  Nicely  (1955),  were  generated  for  all  vowels, 
consonants,  and  monosyllables.     These  matrices  are  frequency  plots 
relating  actual  speakers  to  perceived  speakers;  hence,   they  not  only 
indicate  correct  responses  but  also  yield  information  concerning  the 
pattern  of  errors.     An  attempt  was  made  to  account  for  the  observed 
confusions  among  speakers  on  the  basis  of  rank  order  correlation 
techniques  (Kendall's  tau--Siegel ,  1956)  relating  confusions  to 
the  parameters  obtained  from  acoustic  analyses,  above.  The 
rationale  here  is  that  if,  for  a  given  utterance  type,  speaker 
identification  is  largely  coded  in  some  acoustic  parameter,  then 
that  parameter  should  correlate  highly  with  confusions  among  speakers. 

For  the  voiced  vowels,  confusions  were  related  to  fundamental 
frequency  (fo),   first  formant  frequency  (Fl),  second  formant  fre- 
quency (F2),  third  formant  frequency  (F3),  and  the  ratio  of  formant 
two  to  form.ant  one  frequency  (F2/F1)  .     For  the  whi^^pered  vowels, 
confusions  were  related  to  Fl,  F2,  F3,  and  F2/F1.     For  the  low-pass 
filtered  vowels,   fg  only  was  employed. 
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For  the  voiced  consonants,  /v/  and  /z/,  confusions  ware  re- 
lated to  f^j,  ?1,  and  F2,     The  confusions  among  speakers  for  the 
voiceless  consonants,  I  si  and  /f/,  were  related  to  Fl,  F2,  and  the 
bandwidths  (BW)  of  those  formants,  3W1  and  BW2,  respectively. 

For  the  rnouosyllable,  /va/,   the  following  parameters  were 
chosen:     f^,  the  first  three  formants  of  the  vowel  portion,  and 
the  locus  of  the  second  formant  transitions. 

The  acoustic  parameters  employed  here  were  chosen  because 
they  not  only  represent  characteristics  which  are  both  easily  ex- 
tracted from  the  signal  and  frequently  used  in  acoustic  descrip- 
tions in  the  literature,  but  they  also,   in  some  instances,  repre- 
sent those  acoustic  parameters  thought  to  be  germane  to  speech  in- 
telligibility.    Peterson  and  Barney  (1952),  for  instance,  concluded 
that  vowels  are  differentiated  from  one  another  largely  on  the  basis 
of  the  F2/F1  ratio.     In  connection  with  the  voiceless  fricatives, 
Heinz  and  Stevens  (1961),  and  Flanagan  (1965)  have  indicated  that 
these  are  largely  differentiated  on  the  basis  of  upper  formant  fre- 
quency locations;  Hughes  and  Halle  (1955)  concluded  that  voiceless 
and  voiced  fricatives  are  differentiated  largely  on  the  basis  of 
the  presence  of  substantial  spectral  energy  in  the  latter  below 
1000  Hz.     Stevens  and  House  (1956)  and  Harris  (1958)  have  shown  that, 
in  CV  monosyllables,  the  perception  of  the  consonant  is  signaled 
largely  by  the  locus  of  the  second  formant  t:'-ansition. 

Additional  insight  into  the  relation  of  speaker  identifi- 
cation and  utterance  intelligibility  was  sought,  then,  by  incorpor- 
ating these  acoustic  characteristics  into  the  attempt  to  account  for 
the  confusions  among  speakers  in  the  identification  task.  If 
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identification  and  intelligibility  are  based  on  similar  acoustic 
cues,   the  confusions  ijhould  largely  be  explained  by  those  cues. 


TIT 

RESULTS  AND  DISCUSSION 

It  will  be  recalled  that  the  sentence  stimuli  were  used  to 
obtain  some  estimate  of  the  contributions  to  speaker  identification 
of    suprasegmental  features,  such  as  rate  of  articulation  and  in- 
flection.    The  vov7cl  stimuli  are  directed  toward  the  first  research 
problem,  concerning  the  relative  contributions  of  source  and  vocal 
tract  transfer  characteristics  to  speaker  identification.     The  con- 
sonantal stimuli  concern  the  second  research  problem  but  also  have 
implications  for  the  first  problem.     The  monosyllabic  stimuli  are 
directed  toward  evaluating  the  informational  aspects  of  duration. 

In  this  chapter,  results  are  discussed  initially  by  utterance 
type.     Later  subsections  are  concerned  m.ore  specifically  with  the 
relations  obtaining  between  acoustic  characteristics  and  identifi- 
cation, and  between  utterance  intelligibility  and  identification. 

Sentences 

The  grand  mean  identification  performance  for  the  sentence 
stimuli  was  977o.  Brickcr  and  Pruzanski  (1966)  also  used  sentence 
stimuli  in  their  investigation,  and  their  result  (98%)  is  in  good 
agreement  with  that  obtained  here. 

Analysis  of  variance,  summarized  in  Table  1,  showed  no 
significant  differences  over  listeners,  but  a  significant  difference 
over  speakers.     Figure  3  illustrates  the  differences  over  speakers; 

?7 
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TABLE  I.     ANALYSIS  OF  VARIANCE  SIM-1/\RY  FOR  LISTENERS' 
RESPONSES  TO  SENTENCE  STIMULI. 


SOURCE  SS 

Between  Speakers  4.959 

Between  Listeners  3.21 

Residual  11.791 

Total  19.96 


df  MS  F-ratio 

7  .7084  4.627* 

11  .2918  1.9059 

77  .1531 

95 


"P< .  05 
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jbigure  3:     Listener  performance  for  sentences  produced  by  each  speaker. 
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the  range  of  scores  yielded  by  the  speakers'  utterances  was  87  to 
100%.     The  range  of  listener  performance  was  90  to  100%. 

Listene.r  perf onnance  recorded  for  the  sentences  was  by  far 
higher  than  for  any  other  stiv.iulus  used  in  this  study.     (The  next 
best  stimulus  type  in  terras  of  identification  performance  was  the 
monosyllable  /va/,  for  which  the  grand  mean  vias  58%.)     This  seems 
to  point  to  the  contribution  of  dynamic  variations  in  articulation, 
for  here  the  listeners  could  avail  themselves  not  only  of  such 
characteristics  as  fundamental  frequency  and  formant  frequencies, 
but  also  of  such  suprasegmental  features  as  tempo  and  intonation. 
A  systematic  investigation  of  the  contributions  to  speaker  identifi- 
cations of  such  extra-phonemic  features  would  indeed  constitute  a 
viable  problem  for  future  research.     It  would,   for  example,  be 
interesting  to  amplitude  modulate  white  noise  against  sentences, 
thus  preserving  tempo  and  amplitude  characteristics,  and  discover 
whether  or  not  speakers  may  be  identified  solely  on  this  basis. 

Vowels 

Overall  listener  performance  for  the  vowel  stimuli  is  shown 
in  Figure  4.     The  critical  value  shown  in  this  figure,  and  in 
subsequent  ones,  was  determined  by  reference  to  a  table  of  cumula- 
tive binomial  distributions  (Staff  or  Computational  Laboratory, 
1955),  and  represents  the  minimum  performance  level  for  rejecting 
the  null  hypothesis   (no  performance  above  chance  levels)  at  a 
significance  level  of  .05.     The  overall  means  associated  with  each 
type  of  stimulus  in  Figure  4  are  40.2%  for  the  voiced  vowels, 
2Io8%    for  the  whispered  vowels, and  20.7%     for  the  filtered  vowels. 
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The  data  In  Figure  4  were  derived  from  separate  experimental 
tapes,  and  the  interactions  which  may  have  obtained  among  voiced, 
whispered,  and  filtered  stimuli  are  hence  unknown.     It  is  nonethe- 
less interesting  to  note  that  performance,  where  source  characteris- 
tics alone  were  present  (filtered  vowels)  and  performance  v;here  vocal 
tract  characteristics  alone  were  present  (whispered  vowels),  sura  al- 
most exactly  to  the  performance  where  both  are  present  (voiced  vowels). 
From  this  evidence,  one  might  tentatively  infer  that  the  contribu- 
tions of  source  and  vocal  tract  transfer  characteristics  toward 
speaker  identification  are  approximately  equal  and  additive  (an 
inference  Xi?hich  is  further  supported  by  correlation  of  the  acoustic 
characteristics  of  these  stimuli  to  confusions  among  speaker,  below). 
Additionally,   it  is  clear  from  these  data  that  speaker  identifica- 
tion may  be  achieved  only  on  the  basis  of  source  characteristics 
and  only  on  the  basis  of  vocal  tract  transfer  characteristics. 

Voiced  Vowels 

Analysis  of  variance  procedures  for  main  effects  showed  a 
significant  vowel-speaker  interaction  for  the  voiced  vowels.  Hence, 
tests  for  simple  main  effects,  as  outlined  by  Kirk  (1968),  were 
conducted,  and  are  summarized  in  Table  II.     As  indicated,  there  are 
significant  differences  over  speakers  at  three  of  the  vowels,  and 
significant  differences  over  vowels  at  six  speakers.     At  each  of  the 
latter,  a  posteriori  comparisons  among  vowels  were  performed  using 
Scheffe's  m.echod  (Hays,  1953).     Results  of  this  procedure,  summarized 
in  Table  III,  shoij  a  general  trend  for  low  vowels,  /a/  and  /c5/,  to 
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TABLE  III,     A  POSTERIORI  COMPARISONS  A>IONG  VOICED  VOWELS. 

AT  SPEAKER* 

2                  4                  5                  6                  7  8 

P^l>lul          /a/>/-c$/          /i/>/u/          /^/>/u/          l^>lnl  /35/>/i/ 

/a/>/u/                           /^>/u/                             /a/>/u/  /a/>/i/ 

/a/>/u/ 


-''no  significant  differences  among 
voiced  vowels  Xv-ere  found  at 
speakers  1  and  3. 
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yield  better  identification  performance  than  high  vowels,  /i/  and  /u/ , 
and  a  clear  trend  for  /u/  to  yield  significantly  poorer  identifica- 
tion performance  than  the  lov;  vox-jals.     This  is  somewhat  at  variance 
with  the  conclusion  of  Stevens  et  al^  (1968)  that  front  vowels  gen- 
erally result  Ln  higher  identification  performances  than  back  vowels; 
it  should  be  noted,  however,   that  these  investigators  did  not  employ 
sustained,  isolated  vocalic  utterances. 

I-Thispered  Vowels 

The  results  of  analysis  of  variance  procedures  for  the 
whispered  vcv/els  are  shown  in  Table  IV.     Significant  differences 
among  speakers  obtain  at  all  vowels  but  /u/.     For  three  speakers, 
there  are  significant  differences  between  the  performances  yielded 
by  vowels. 

Table  V  summarizes  the  results  of  a  pos tcrior i  comparisons 
among  whispered  vowels,  and  indicates  the  same  trends  observed  for 
differences  among  voiced  vowels--viz , ,  that  lov/  vowels,  /a/  and  /^/ , 
yield  better  identification  scores  than  high  vowels. 

In  connection  with  this  general  trend,  an  explanation  for 
this  high  vowel-low  vowel  difference  was  sought  in  comparing  the 
relative  formant  frequency  amplitudes  for  the  four  vov/els,  voiced 
and  whispered.     These  measures  were  extracted  from  narrow  band 
spectrographic  sections,  and  are  shown  in  Table  VI.     As  indicated, 
the  formant  amplitudes  for  the  low  vowels  are  considerably  greater 
rhan  for  the  high  vowels,  especially  in  the  voiced  condition.  This 
distinction  is  present,  but  not  as  great,  for  the  whispered  vov/els; 
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TABLE  V.     A  POSTERIORI  COMPARISONS  AMONG  TOISPERED  VOWELS. 

AT  SPEAKER" 

2  3  4 

/5$/>/i/  /«/>/a/  /a/>/u/ 

/^>/u/  /a/>/i/ 

/a/>/ae/ 


'■'no  significant  differences  among 
vowels  were  found  at  speakers 
1,  5,  6,  7,  and  8. 
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TABLE  VI.     AVEPAGE  FORMANT  FREQUENCY  AMPLITUDES  FOR  VOICED  AND 
WHISPERED  VOWELS  (dB  DOWN  FROM  Fjl  AMPLITUDE)  . 

VOICED  WHISPERED 
/i      u        ^       a/  /i         u      "(^  a/ 


F         23      30      15      1.5  7        13       6  7 


39 

Fil tered  Vowels 

As  indicated  ia  the  aiuilysis  of  variance  surnmary  in  Table  VII, 
no  significant  differences  among  vov.'els  were  found  for  these  stimuli. 
Since  these  stimuli  do  not  contain  '"orraant  structures,   this  r-  sult 
supports  the  notion  that  the  differences  among  vcv;elo  found  for  the 
voiced  and  whispered  stimuli  are  indeed  due  to  foriTiant  amplitude 
differences . 

Overall,  performances  for  the  vowel  stimuli  indicate  that 
speaker  identification  is  possible  on  the  basis  of  source  information 
only  and  on  the  basis  of  vocal  tract  characteristics  only.  Addition- 
ally, there  is  some  indication  that,  for  the  voiced  vowels,  these 
contributions  are  equal  and  additive. 

Consonants 

The  grand  mean  identification  performance  for  the  continuant 
consonants  was  21.82%.     Figure  5  summarizes  overall  listener  perfor- 
mance for  each  consonant,  and  shows  that  all  consonants  yielded 
speaker  identification  performance  at  a  level  significantly  above 
chance. 

Analysis  of  variance  for  the  consonantal  stim.uli,  shown  in 
Table  VIII,   indicates  significant  differences  among  speakers  at  the 
voiced  consonants,  /v/  and  /z/,  and  significant  differences  among 
consonants  for  four  speakers.     Comparisons  among  consonant  means  at 
each  of  these  speakers  are   shown  in  Table  IX.  The  general  trend  ob- 
served is  that  the  voiced  consonants,  /v/  and  /,'./,  yield  significantly 
higher  identification  performance  than  their  voiceless  cognates,  /f/ 
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and  /s/. 

These  data  would  seem  to  indicate  that  source  characteristics, 
at  least  for  these  stimuli,  are  more  heavily  weighed  by  the  listener 
making  identity  judgments  than  are  resonant  characteristics  of  the 
vocal  tract.     It  must  be  noted  that  such  a  conclusion  construes  an 
oversimplification,  for  the  cognates  employed  here  are  not  differ- 
entiated simply  by  the  presence  or  absence  of  fundamental  frequency. 
Glottal  excitation,   in  the  case  of  /v/  and  /z/,  also  yields  vocal 
tract  resonances  not  present  in  /f/  and  /s/. 

Actually,   it  v/as  rather  surprising  that  /f/  and  /s/  did  yield 
speaker  identification  performance  above  a  chance  level.     The  artic- 
ulatory  constrictions  which  serve  as  a  turbulent  source  for  these 
phonemes  are  located  very  far  forward  in  the  vocal  tract,  and  the 
anterior  portion  of  the  vocal  tract  which  resonates  to  this  excita- 
tion is  quite  short;  hence  one  would  expect  that  the  articulatory 
and  acoustic  distinctions  among  speakers  would  be  lost.  Although 
the  data  indicate  that  such  distinctions  are  not  lost,  they  do  ind- 
cate  that  differences  are  minimized.     The  performances  recorded  for 
/f/  and  I  si  (15.27o  and  16.2%,  respectively)  were  the  two  lov/est  scores 
encountered  in  this  study. 

]Jtterj?iice_Intolligibility  and  Speaker  Identification 

The  overall  intelligibility  levels  for  those  utterances 
where  the  listeners  Xv^re  forced  to  make  intelligibility  decisions 
as  well  as  speaker  identity  decisions  are  shewn  in  Figure  6.     As  a 
group,  the  filtered  vowels  V7cre  not  intelligible;   individually,  /u/ 
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was  the  only  intelligible  filtered  vowel. 

The  listeners  could  produce  only  four  types  of  responses  for 
these  stimuli:     (a)  speaker  identification  correct,  stimulus  intelli- 
gibility correct,   (b)  speaker  identification  incorrect,  stimulus 
intelligibility  correct,   (c)  speaker  identification  correct,  stimulus 
intelligibility  incorrect,  and  (d)   speaker  identification  incorrect, 
stimulus  identif icatioii  incorrect. 

The  proportion  of  each  type  of  response  for  all  utterances 
is  shown  in  Figure  7.     Of  particular  interest  here  are  response 
type  (b)    (identification  incorrect,   intelligibility  correct)  and 
response  type  (c)   (identification  correct,  intelligibility  incorrect). 
For  the  voiced  and  whispered  vowels  and  the  continuant  consonants, 
response  type  (b)   is  by  far  the  most  typical.     This  would  indicate 
that  stimulus  intelligibility  is  not  a  sufficient  preliminary  to 
speaker  identification.     On  the  other  hand,   the  filtered  vowels, 
unintelligible  as  a  group,  yielded  speaker  identification  performances 
which  were  significantly  above  chance.     The  inference  here  is  that 
stimulus  intelligiblity  is  not  a  necessary  preliminary  to  speaker 
identification. 

The  evidence  noted  here  suggests  then  that  speech  intelli- 
gibility and  speaker  identification  are  not  necessary  concomitants-- 
that  is,  it  is  possible  to  have  one  ^^7ithout  the  other.     That  they 
are  qualitatively  different  percepts  is  also  reinforced  by  the  fact 
that  the  cues  which  have  been  traditionally  viewed  as  important  to 
the  perception  of  an  utterance  generally  do  not  correlate  highly, 
as  mentioned  below,  with  speaker  confusions  generated  by  the 
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identiif icatiau  tasks. 

In  the  introduction  to  this  scudy,  it  was  noted  that  most, 
models  of  phoneme  recognition  involve,  as  a  preliminary,  analysis 
of  the  input  in  the  time- frequency-amplitude  domain.     One  might 
ascribe  the  process  of  speaker  identification  entirely  to  such  a 
preliminary  analysis  component  if  and  only  if  evidence  of  acoustic 
invar iances  to  speaker  identification  had  been  found.     Although  some 
of  the  acoustic  parameters  extracted  in  this  study  did  correlate 
rather  v/ell  with  speaker  identification  performance,  the  correla- 
tions were  not  high  enough  to  characterize  those  parameters  as 
invariants  to  the  process.     It  seems  clear  then  that  time- frequency- 
amplitude  information  is  also  only  a  preliminary  to  speaker  identi- 
fication, and  that  this  information  m.ust  undergo  further  analysis, 
or  sharpening,  before  decisions  are  reached. 

The  nature  of  this  additonal  analysis  is  unknown.  Further 
insight  into  this  problem  may  be  offered  by  dichotic  listening 
tests  in  which  the  subjects'   task  is  to  identify  speakers,  not 
utterances.     In  the  dichotic  paradigm,  pairs  of  speech  samples  are 
delivered  simultaneously  to  the  listener,  one  to  the  right  ear  and 
one  to  the  left.     The  presence  of  an  ear  advantage  (that  is,   if  the 
listener  reports  the  stimuli  presented  to  one  ear  more  accurately 
than  the  stimuli  presented  to  the  other  ear)   is  taken  to  indicate 
that  perceptual  processing  is  mediated  chiefly  at  the  cortex  contra- 
lateral to  ear  showing  the  advantage.     It  has  been  shown  that  there 
is  a  right  ear-left  hemisphere  effect  for  categorical  or  coded  speech 
materials  (as  discussed  by  Liberman  et  al, ,  1967),  for  example. 
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consoiiant-vowel  monosyllables;  the  most  cogent  inference  drawn  from 
these  data  is  that  the  left  hemisphere  is  specialized  for  linguistic 
processing  (Studdert-Kennedy  and  Shankvreiler ,   1970).     Kimura  (1964) 
has  shown  that  a  left  ear-rlglit  hemisphere  advantage  is  found  for 
musical  passages,  and  hence  the  left  hemisphere  is  considered  to 
be  specialized  for  the  perception  of  auditory  pattern?. 

Darwin  (1969)  has  shown  that  there  exists  no  ear  advantage 
for  vov7els,   indicating  that  perceptual  processing  for  these  is  per- 
haps occurring  at  a  sub-cortical  level. 

Hence,  dichotic  presentations  involving  speaker  identifica- 
tion judgments  might  be  expected  to  yield  information  concerning 
the  locus  and. nature  of  the  processing  involved  in  those  judgments. 
A  right  ear  advantage  would  indicate  that  the  speaker  identification 
is  associated  with  linguistic  cues;  a  left  ear  advantage  v/ould 
indicate  that  the  processing  is  chiefly  a  matter  of  auditory  pattern 
analysis  (and  would  point  toward  the  importance  of  suprasegmental 
features);  finally,  the  absence  of  an  ear  advantage  V70uld  indicate 
that  the  processing  is  either  low- level  or  involves  a  combination 
of  linguistic  and  pattern  analyses. 

Sample  Time  Interval 

Figure  8,  a  summary  of  listener  performance  for  the  natural 
and  synthetic  monosyllables  and  equivalent  durations  (1250  m.sec.) 
of  the  isolated  phonemes  froi.i  which  the  latter  was  generated,  defies 
a  simple  interpretation.     There  is  evidence  of  a  "stairstep"  function 
in  Figure  8,     It  is  interesting  to  note,   for  example,  that  if  the 
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overall  performance  for  /v/  is  averaged  with  the  overall  performance 
for  /a/,   the  result  is  very  nearly  the  performance  yielded  by  the 
synthetic  monosyllable  ,/v+a/ .     This  seems  to  indicate  that  the 
listeners  were  reaching  their  decisions  for  this  stimulus  on  the 
basis  of  steady  state  acoustic  characteristics.     Although  they  were 
indeed  treating  /v-l-a/  as  a  tv;o  phoneme  utterance,  note  that  the  over- 
all performance  effect  was  not  additive,  but  averaged. 

On  the  other  hand,  there  was  a  significant  difference  be- 
tween /va/  and  /v-l-a/,  as  indicated  by  the  analysis  of  variance 
summary  in  Table  X.     The  most  cogent  explanation  of  this  difference 
is  that  listeners  are  not  reaching  speaker  identification  judgments 
for  /va/  only  on  the  basis  of  the  target  acoustic  values  of  this 
monosyllable's  constituent  phonemes,  but  also  on  the  basis  either 
of  the  added  formant  transition  values  or  suprasegmental  features. 

In  general,  the  trend  indicated  by  these  data  would  tend  to 
support  the  notion  that  utterance  duration  is  an  important  variable 
in  speaker  identification  in  that  it  allows  listeners  to  sample 
larger  segments  of  a  speaker's  phonemic  repertoire.  Furthermore, 
this  added  information  is  based  not  on  steady  state  phonemic  cues 
but  on  some  more  integral  basis. 

Acoustic  Analyses 

Results  of  acoustic  analyses  of  the  utterances  used  in  this 
investigation  arc  tabled  in  Appendix  D.     (In  connection  with  the 
latter,  the  fundamental  frequencies  extracted  for  the  voice  vowels 
are  the  same  as  those  for  the  low-pass  filtered  -vowels,  since  the 
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latter  V7ere  generated  directly  from  the  former.) 

In  an  attempt  to  discover  the  bases  on  v/hich  listeners 
were  arriving  at  speaker  identif cation  judgments,  they  were 
assigned  ranks  according  to  the  order  in  which  they  would  be 
expected  to  be  confused  with  each  other  had  some  acoustic  parameter 
been  the  basis  of  speaker  identification.     These  rank  orders  in 
expected  confusions  among  speakers  were  then  correlated  with  rank 
orders  in  actual  confusions  (contained  in  Appendix  E)  among  speakers. 
A  high  rank  order  correlation  obtained  on  the  basis  of  some  acoustic 
parameter  would  indicate  that  the  parameter  was  heavily  weighed  by  ■ 
the  listener  in  reaching  his  decisions. 

Vowels 

Table  XI  summarizes  the  results  of  tnis  procedure  for  the 
voiced  vowels.     Each  cell  entry  in  Table  X  represents  the  degree  of 
correlation  between  the  actual  confusions  among  speakers  against 
each  speaker  (Xi)  and  the  expected  confusions  among  speakers  pre- 
dicted by  the  rank  of  some  acoustic  parameter  against  each  speaker 
(Yi). 

Tlie  statistic  employed  in  Tables  XI  through  XIV  is  Kendall's 
tau.     The  actual  cell  entries  in  the  tables  are  the  denominator  of 
this  statistic,  S;  Siegel  (1956)  notes  that  S  has  the  same  probability 
distribution  as  tau,  and  provides  a  table  for  the  probability  of 
obtaining  any  given  S.     The  last  column  in  Tables  XI  through  XIV 
lists  the  probability  of  obtaining  the  actual  S  values  under  the 
null  hypothesis  of  no  association  between  ranks. 
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It  should  be  noted  at  the  outset  that  none  of  the  tabled 
probabilities  are  equal  to  or  smaller  than  .05.     Although  this 
does  not  invalidate  the  trends  V7hich  may  be  drawn  from  tliese 
it  does  seem  to  indicate  that,  at  least  for  the  utterances  and 
acoustic  parameters  employed  in  this  study,   there  exists  no  set 
of  acoustic  invariances  to  speaker  identification. 

For  the  voiced  vov;els,  the  data  in  Table  XI  indicate  that, 
in  general,   (1)  fundamental  frequency,   (2)  the  second  formant  fre- 
quency (F2),  and  (3)  the  third  formant  frequency  (F3)  are  equally 
good  predictors  of  confusions  among  speakers.  ,  This  is  entirely 
consistent  with  the  speaker  identification  results  obtained  with 
these  stimuli,  discussed  above,  and  indicates  that  these  parameters 
are  indeed  the  basis  of  speaker  identification  judgments  for  voiced 
vowels. 

Incidentally,  it  was  noted  in  the  introduction  that  Compton 
(1963)  had  used  only  the  vov/el  ,/i/,  in  his  investigation  and  had 
concluded  that  fundamental  frequency  was  the  basis  for  speaker 
identification  decisions.     The  data  here  bear  this  out,  but  they 
also  show  that  this  conclusion  does  not  generalize  to  vowels  as  a 
whole. 

For  the  whispered . vowels ,   the  data  in  Table  XII  show  that 
F3  is  the  best  predictor  overall  for  confusions  among  speakers. 
Once  again,  /i/  exhibits  a  trend  distinct  from  the  other  vowels; 
for  this  whispered  phoneme,  the  ratio  F2/F1  is  the  best  predictor 
of  speaker  confusions.     Overall,  though,  the  F2/F1  ratio  is  a  poor 
predictor  of  confusions;  as  nctad  previously,  this  ratio  has  been 
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considered  as  a  primary  cue  to  vov;el  intelligibility.     There  is  an 
indication  here,   then,   that  utterance  intelligibility  and  speaker 
identification  are  based  on  different  cues. 

The  correlations  obtained  for  the  low-pass  filtered  vov7ols, 
also  shown  in  Table  XIl,  are  the  highest  obtained  for  all  utterances, 
demonstrating  that  the  listeners  are  indeed  using  fundamental  frequency 
cues  in  reaching  identity  judgments  for  these  stimuli.     Overall,  the 
correlations  obtained  for  the  vowel  stimuli  offer  more  conclusive 
evidence  for  the  notion  that  the  contributions  of  source  and  vocal 
tract  transfer  characteristics  are  equal  and  additive.     Also  of 
interest  is  the  finding  that  cues  which  arc  thought  to  be  crucial 
to  speech  intelligibility  are  poor  predictors  of  confusions  among 
speakers . 

Consonants . 

Table  XIII  shows  that  fundamental  frequency  is  a  better 
predictor  of  confusions  than  the  formant  structure  for  the  voiced 
continuants,  /v/  and        .     For  /s/  and  /f/,  however,  the  center 
frequency  of  the  first  formant  accounts  for  more  confusions  than 
formant  bandwidth s. 

The  correlations  obtained  for  /v/  are  uniformly  lower  than 
those  obtained  for  /z/,  yet  the  fundamental  frequencies  data  for 
these  stimuli  were  very  similar.     This  would  seem  to  indicate  that 
there  are  other  factors  involved,  in  speaker  identification  for  these  ■ 
stimuli.     As  noted  by  Hughes  and  Halle  (1956)  the  acoustic  character- 
istics for  these  phonemes  are  quite  complex,  since  they  represent  the 
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interactioas  of  two  sources  (one,  quasi-periodic,  at  the  larynx 
and  the  other,   turbulent,  at  the  point  of  constriction)  and  the  . 
vocal  tract  resonances  which  they  excite.     Tliat  these  resonances 
do  play  a  role  in  speaker  identification  may  be  inferred  by  noting 
that  the  portion  of  the  vocal  tract  anterior  to  the  consonantal 
constriction  associated  with  /z/,  an  alveolar  phoneme,  is  consider- 
ably longer  (and  hence  would  more  enhance  interspeaker  variations) 
than  that  associated  with  /v/,  a  labio-dental  phoneme. 

Tliis  same  trend  between  degree  of  correlation  and  place  of 
articulation  is  evidence  for  the  voiceless  consonants,  /s/  and  /f / . 

Monosyllables 

As  shoMi  in  Table  XIV,  confusions  among  speakers  for  the 
monosyllable,  /va/,  are  highly  predictable  from  fundamental  frequency. 
Note  also,  however,   that  F2,  F3,  and  the  locus  of  the  F2  transition 
also  correlate  reasonably  well  with  obtained  confusions.  Fundamental 
frequency  and  the  F2  locus  represent  here  characteristics  of  the 
entire  utterance,  and  are  not  characteristics  proper  of  its  con- 
stituent phonemes.     This  reinforces  the  notion,  discussed  above, 
that  multi-phouemic  utterances  yield  higher  speaker  identification 
scores  not  because  of  target  acoustic  values  but  rather  on  the 
bases  of  phonemic  interactions.     These  data  arc  not  adequate  for 
firmly  establishing  the  nature  of  such  interactions,  but  they 
strongly  suggest  their  existence. 

Differences  Among  Listeners  and  Speakers 

The  analysis  of  variance  procedures  detailed  above  indicated 
significant  differences  among  listeners  for  the  voiced  and  whispered 
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vowels,  consonants,  and  monosyllables.     For  the  vowels,  speaker 
identification  performance  by  listener,  pooled  over  speakers,  is 
shown  in  Figure  9.     Listener  performances  for  the  consonant  and 
monosyllabic  stimuli  are  showTi  in  Figure  10  and  Figure  11,  respec- 
tively. 

The  general  trend  evidenced  in  these  representations  is 
that  listeners  1,  2,  3,  5,  and  6  perform  consistently  well,  v;hile 
the  performances  of  listeners  4,  7,  11  appear  to  be  consistently 
depressed. 

In  regard  to  these  trends,  it  is  interesting  to  note  that 
each  of  the  listeners  whose  performances  were  consistently  high 
are  better  acquainted  with  the  speakers,  as  a  group,   than  are 
listeners  4,  7,  and  11.     Additionally,   the  latter  are  the  only 
listeners  who  are  not  professionals  or  advanced  students  in  the 
field  of  speech  and  hearing  science,  a  group  v/hich  has  had  con- 
siderable experience  in  serving  as  subjects  in  behavioral  ex- 
periments (it  should  be  noted,  however,   that  listener  4  does  have 
such  experience) . 

Tlie  important  feature  in  the  listeners'  performance  is  con- 
sistency.    This  \\70uld  seem  to  indicate  that  the  differences  encoun- 
tered among  listeners  are  not  due  to  such  transient  effects  as 
fatigue  or  attention,  but  rather  to  experience  both  with  the  speakers 
and  stimulus -response  paradigms. 

Significant  differences  among  speakers  v:ere  found  for  the 
voiced  and  whispered  vowels,  voiced  consonants,  and  the  monosylla- 
bles.    Identification  performance  by  speaker  for  the  vowel  stimuli. 
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pooled  over  listeners,   is  shown  in  Figure  12.     CK'crall,  the 
additivity  effect  of  source  and  vocal  tract  transfer  character- 
istics holds  here.     There  is  sorae  evidence,  however,  that  if  a 
speaker's  utterances  showed  some  acoustic  characteristic  , which 
V7as  quite  distinct  from  the  others  in  the  group,  then  the  listeners 
tended  to  weigh  that  parameter  more  heavily  when  making  identity 
judgments  for  that  speaker.     Note,  for  instance,  that  the  performances 
yielded  by  speakers  2  and  5  tend  to  be  higher,  for  voiced  and  filtered 
vov/els,  than  those  of  the  other  speakers.     Acoustic  analyses  revealed 
that  the  mean  fundamental  frequencies  for  speakers  2  and  5  (126  Hz 
and  144  Hz,  respectively)  are  higher  than  those  for  the  rest  of  the 
group. 

WTcn  speaker  5's  distinctive  source  information  is  absent-- 
as  in  the  v/hispered  vowels--the  performance  yielded  by  his  utter- 
ances then  deteriorates  dramatically.     The  formant  frequencies  for 
each  speaker's  whispered  vowels  are  plotted  in  Figure  13  and  Figure 
14.     Note  that  the  formant  frequency  values  for  the  utterances  of 
speakers  5  and  8- -which  were  identified  at  a  level  below  chance-- 
tend  to  be  unexceptional  (but  for  F^  for  speaker  5's  utterance  of 
/i/).     On  the  other  hand,  the  formant  frequencies  for  speaker  2's 
whispered  vovrels,  v/hich  yielded  the  highest  identification  scores 
among  the  group,  tend  to  be  exceptional;  note  the  low  F3  for  /u/, 
the  high  F3  for  y^/,  and  the  very  high  F^  for  /a/. 

Performance  by  speaker  for  the  consonants,  pooled  over 
listeners,  is  s!:oto  in  Figure  15.     Tlie  performances,  for  the 
voiced  consonants,  yielded  by  speakers  1,  2,  5  and  7  tend  to  be 
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Figure  13:     Formant  frequency  values  by  speaker  for  whispered  /i/  and  /u/ 
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Figure  14:     Formant  frequeucv^  valups  by  speaker  for  whispered  I'^f  and 
/a/. 
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higher  than  those  of  other  speakers.     Fundamental  frequency  measures 
tend  to  offer  an  explanation  for  these  differences:     the  fundamental 
frequencies  for  the  voiced  consonant  utterances  of  speakers  1,  2,  and 
5  (137  Hz,  130  Hz,  and  144  Hz,  respectively)  are  the  three  highest 
measures  for  the  group,  while  speaker  7's  utterances  show  the  lowest 
such  measure  (106.5  Hz)  for  the  group. 

For  the  monosyllabic  stimuli,  performance  by  speaker  is 
shown  in  Figure  16.     Differences  among  speakers  for  the  'synthetic' 
monosyllable  tend  to  show  the  same  pattern  exhibited  for  voiced 
consonants  (Figure  13)  and  the  low-passed  vowels  (Figure  10),  in- 
dicating that  the  same  acoustic  cue  (i.e.,   fundamental  frequency) 
is  being  used  for  /v+a/  as  for  these  stimuli.     For  the  natural  mono- 
syllable, no  explanation  for  the  differences  among  speakers  is 
offered,  however,  by  the  distribution  of  the  acoustic  parameters, 
which  were  extracted  in  this  study.     It  may  well  be  that  speaker 
identification  performance  for  this  stimulus  is  determined  by  a 
suprasegmental  feature    such  as  inflection. 

Overall,  the  trends  exhibited  in  the  differences  among  speakers, 
although  they  do  not  apply  universally,  confirm  that  the  acoustic 
parameters  which  were  found  to  correlate  with  speaker  confusions 
are  indeed  the  basis  for  listener  judgments,     there  is  also  an 
indication  that  listeners  more  heavily  weigh  a  given  acoustic 
correlate  if,  for  a  given  speaker,  it  stands  in  distinction  from 
the  general  speaker  group. 


IV 

SUMMARY  AND  CONCLUSIONS 

An  investigation  was  undertaken  concerning  the  ability  of 
subjects  to  identify  speakers  solely  on  the  basis  of  voice.  The 
purposes  of  this  study  were: (1)  to  establish  the  relative  contribu- 
tions of  source  and  vocal  tract  transfer  characteristics  to  speaker 
identification,   (2)  to  establish  whether  or  not  speakers  could  be 
identified  on  the  basis  of  isolated  utterances  of  continuant  con- 
sonants,  (3)  to  investigate  the  nature  of  the  relation  between 
utterance  intelligibility  and  speaker  identification,  and  (4)  to 
determine  whether  sample  duration  was  a  variable  in  speaker  identi- 
fication in  absolute  or  relative  terms. 

The  subjects  for  this  study  consisted  of  eight  male  speakers 
and  fnreive  listeners;  the  latter  had  been  in  routine  contact  with  the 
former  for  a  period  of  at  least  six  months.     The  following  speaker 
utterances,  equated  for  intensity,  were  presented  to  the  listeners: 
two  prose  sentences;   four  vowels  (/i,  u,£a.,  a/)  under  three  coadi- 
tions,  voiced,  whispered,  and  low-pass  filtered  at  200  Hz;  four 
consonants  (/s,  f,  v,  z/) ;  tv70  monosyllables,  one  natural  (/va/), 
and  one.  generated  by  abutting  two  steady  state  phonemic  excerpts, 
(/v+a/). 

The  three  vowel  conditions  were  taken  to  simulate  the 
presence  only  of  (1)  source  information  (filtered  vowels). 
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(2)  vocal  tract  transfer  information  (whispered  vowels) ,  or  (3) 
both  (voiced  vowels).     Except  for  the  sentences,  all  stimuli  were 
presented  at  a  duration  of  1250  msec.     The  inclusion  of  the  mono- 
syllables allowed  for  the  evaluation  of  the  contributions  of  the 
informational  aspects  of  duration;   if  the  latter  was  a  variable 
only  in  absolute  terms,  no  differences  in  speaker  identification 
performance  for  single  phoneme  vs  two  phoneme  utterances  would  have 
been  obtained. 

Hie  listeners  were  presented  with  forms  listing  each  speaker 
by  initials,  and  their  task  was  to  circle  the  speaker  they  felt 
produced  each  item.     Ilie  listeners  were  also  required  to  choose 
which  stimulus  item  was  presented  for  all  the  vowel  and  consonant 
stimuli  employed  in  the  study. 

Acoustic  analyses  of  the  speakers'  utterances  were  performed 
and  the  following  parameters  were  extracted:     fundamental  frequency, 
the  first  three  formant  frequencies,  the  ratio  of  the  second  to  the 
first  formant  frequency,  formant  bandwidths  (for  the  voiceless 
consonants)  and  formant  amplitudes  (for  the  voiced  and  whispered 
vowels).     The  confusions  among  speakers  predicted  by  each  of  these 
parameters  were  correlated  ;i7ith  the  actual  confusions  among  speakers 
in  an  attempt  to  ascertain  x^^hich  acoustic  characteristics  serve  as 
important  cues  to  speaker  identification. 

Tlie  results  of  this  study  may  be  summarized  as  follows: 

1,  All  stim.uli  yielded  speaker  identification  performance 
at  a  level  significantly  above  chance. 

2.  The  sentence  stimuli  resulted  in  performance  far  above 
any  other  stimulus  type.  ■ 
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3.  The  perfomaiices  achieved  for  whispered  vowels  and 
filtered  voTzels  were  very  nearly  equal  and  suirar.ed  to  the  performnce 
achieved  for  voiced  vowels.     The  correlations  between  acoustic 
characteristics  and  confusions  among  speakers  revealed  that  funda- 
mental frequency,  the  second  formant,  and  the  third  form^nt  were 
equally  good  predictors  of  speaker  confusions.  There  was  a  general 
trend  for  low  vowels  to  yield  higher  performances  than  high  vowels. 

4.  Tlie  voiced  continuant  consonants  yielded  significantly 
higher  performances  than  their  voiceless  counterparts.  Fundamental 
frequency  was  the  best  predictor  of  speaker  confusions  for  the 
voiced  consonants;  for  the  voiceless  consonants,  the  first  formant 
frequency  was  the  best  such  predictor  obtained,  though  the  correla- 
tion was  weak  in  absolute  terms. 

5.  In  regard  to  duration,  /va/  yielded  significantly  better 
results  than  /v+a/,  and  also  resulted  in  performances  above  equivalent 
durations  of  /v/  and  /a/.     Also,   if  the  performance  for  /v/  and  for 
/a/  were  averaged,  the  result  was  very  nearly  the  performance  yielded 
by  /v+a/. 

6.  Differences  among  listeners  were  accounted  for  in  terms 
of  their  relative  familiarity  both  with  speakers  and  with  behavioral 
paradigms.     The  trends  present  in  the  differences  among  speakers 
were  largely  accounted  for  in  terms  of  the  acoustic  parameters  of 
their  utterances. 

7.  Little  correspondence  was  found  between  the  cues  important 
for  speech  intelligibility  and  those  thought  to  be  important  for 
speaker  identification.     Utterance  intelligibility  was  found  to  be 
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neither  a  necessary  nor  sufficient  concomitant  to  speaker  identi- 
fication. 

The  major  conclusions  provided  by  this  investigation  are 
that, although  one  can  point  to  acoustic  correlates  of  speaker ■ identi- 
fication, there  seem  to  be  no  acoustic  invariants  related  to  speaker 
identification;  furtherirore,  speech  intelligibility  and  speaker  identi 
fication  seem  to  be  qualitatively  different  percepts.     This  would 
indicate  that  an  adequate  model  for  phoneme  identification  v;ould 
not  necessarily  serve  as  an  adequate  model  for  speaker  identifica- 
tion and  vice-versa.     Further  research  into  the  nature  and  locus  of 
speaker  identification  processing  is  strongly  recommended,  and  a 
dichotic  listening  paradigm  may  prove  particularly  fruitful. 

Other  and  more  specific  conclusions  also  seem  warranted. 
First,   speaker  identification  for  vowels  appears  to  be  based  on  both 
fundamental  frequency  and  formant  frequency  information;  the  in- 
fluence of  these  parameters  is  both  equal  and  additive.   The  general 
trend  for  low  vowels  to  yield  higher  performance  than  high  vowels 
may  be  accounted  for  by  systematic  differences  in  the  formant 
amplitudes  of  these  vowels. 

Secondly,  speaker  identification  is  possible  on  the  basis 
of  isolated  continuant  consonants.     The  level  of  performance  achieved 
for  these  stimuli,  although  above  chance,  was    the  lowest  encountered 
in  this  study.     Although  identification  of  the  voiced  consonants 
correlates  well  with  fundamental  frequency,  additional  research 
into  the  nature  of  the  acoustic  cues  which  allow  identification 
of  the  voiceless  consonants  is  needed. 
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Thirdly,   the  sainple  time  interval,  contributes  to  speaker 
identification  in  a  relative  sense  only--i,e.,  what  is  iiv.portant 
is  not  the  absolute  duration  of  this  interval ,  but  the  nature  of 
the  utterance  contained  in  the  interval.     Specifically,   this  study 
demonstrates  that,  for  a  given  duration,  multiphonemic  utterances 
yield  better  speaker  identification  performances  than  single  phoneme 
samples;  further,  this  added  information  is  based  on  some  integral 
measure  of  a  multiphonemic  utterance  and  not  on  the  target  values  of 
its  constituent  phonemes. 

Finally,  the  very  high  performance  yielded  by  the  sentence 
stimuli  points  to  the  possible  importance  of  suprasegmental  cues 
such  as  tempo  and  inflection  to  speaker  identification. 
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INSTRUCTIONS  TO  EVALUATORS 
OF  SPEAKERS'  UTTERANCES 
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APPENDIX  B 
INSTRUCTIONS  TO  LISTENERS 


INSTRUCTIONS 


This  is  an  experiment  in  speaker  identification.     You  will 
be  listening  to  various  speech  samples,  each  produced  by  one  of 
the  eight  speakers  pictured  on  the  wall  in  front  of  you.     Your  task 
is  to  listen  to  each  sample  and  then  decide  v/hich  speaker  produced  i 
in  some  cases,  you  will  also  decide  on  which  phoneme  was  produced. 

Indicate  your  decision  by  circling  the  appropriate  speaker 
and  phoneme.     Please  respond  to  all  items  (if  you  are  not  sure, 
guess).     If,  after  circling  an  item,  you  change  your  mind,  cross 
out  the  former  decision  and  circle  the  new  one. 

There  will  be  a  four  second  interval  between  each  item. 
After  every  25th  sample,  there  will  be  a  ten  second  pause,  so 
that  you  may  turn  to  a  new  response  page.     If  you  find  that  you 
have  not  completed  a  response  page  vjhen  the  longer  pause  occurs, 
notify  the  experimenter  immediately. 

Please  note  that  the  pictures  of  the  speakers  and  their 
identifying  initials  appear  on  the  wall  in  the  same  left  to  right 
order  as  the  initials  on  the  response  form  (please  also  note  that 
John  Booth  and  John  Brandt  have  the  same  initials;  the  latter  is 
designated  here  as   'Dr3'  and  not  as  'JB'). 

The  first  series  of  sam^ples  you  will  hear  will  be  the 
utterance:     "VJlien  the  sunlight  strikes  raindrops  in  the  air,  they 
act  like  a  prism  and  form  a  rainbow.     The  rainbow  is  a  division  of 
white  light  into  many  beautiful  colors." 

Before  the  start  of  subsequent  series,  the  experimenter 
will  inform  you  of  the  specific  samples  you  will  be  hearing. 

If  you  have  any  questions,  please  ask  them  now. 
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APPENDIX  C 
EXAMPLE  OF  LISTENER  RESPONSE  FORM 


PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 

PW  CL  RI  EM  BB 


CODE: 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 

JB  DrB  SB 


s  f  z  V 

s  f  z  V 

s  f  z  V 

s  f  z  V 

s  f  z  V 

s  f  z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 

S  f  Z  V 


82 


APPENDIX  D 


MEASURES  DERIVED  FROM  ACOUSTIC 
ANALYSES  OF  SPEAKERS'  UTTERANCES 


TABLE  XV. 


FTJNDMIENTAL  FREQUENCY  AND  FORMANT  FREQUENCY  (Hz) 
MEASURES  FOR  THE  VOICED-  AND  WHISPERED  VOITELS 


SPEAKER 
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101 

124 

113 
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114 

109 

130 

240 

280 

280 

280 

320 

280 

320 

320 

/  ■;  / 

F2 

0  A  /.  r» 

Z5d(J 

2oU(J 

25  20 

O  C  ^  A 

25o0 

O  /  /  A 

2040 

O  "7  O 

2720 

^3 

2960 

3160 

3640 

3880 

3160 

3040 

2960 

3680 

103 

122 

118 

107 

149 

115 

112 

128 

/u/ 

200 

280 

280 

280 

280 

320 

320 

320 

^2 

960 

1040 

960 

760 

1120 

1000 

880 

1000 

^3 

2360 

2320 

2560 

1640 

2800 

2560 

3080 

2600 

^0 

95 

132 

118 

105 

I/O 

142 

113 

109 

117 

Fl 

840 

720 

760 

760 

800 

800 

680 

800 

F2 

1640 

1680 

1880 

1960 

1720 

1840 

1520 

1680 

F3 

2560 

2440 

2520 

2600 

2400 

2560 

2680 

2440 

u 

96 

126 

116 

104 

143 

113 

106 

127 

^'1 

680 

800 

640 

800 

780 

760 

640 

800 

/a/ 

F2 

1140 

1080 

1160 

1330 

1200 

1160 

1060 

1180 

F3 

2400 

2640 

2560 

2320 

2420 

2600 

2120 

2600 

Fl 

240 

320 

3  20 

400 

360 

320 

280 

320 

/if 

F-) 

2320 

2480 

2600 

2680 

2840 

2600 

2720 

2720 

F3 

2840 

3440 

3040 

3240 

3840 

3120 

3480 

3520 

Fl 

400 

400 

320 

440 

400 

360 

360 

440 

/u/ 

1480 

1120 

1040 

920 

1080 

1080 

880 

960 

F3 

2520 

2240 

2560 

2520 

2680 

2640 

2600 

3040 

840 

840 

880 

800 

800 

880 

760 

1000 

1840 

1680 

2000 

2400 

1920 

2160 

1760 

1920 

2720 

2400 

2640 

2920 

2520 

2760 

2520 

2800 

760 

960 

880 

920 

800 
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840 

960 

/a/ 

1240 

2520 

1280 

1480 

1200 

1240 

1280 

1560 

F3 

2560 

3080 

2720 

2440 

2600 

2720 

2640 

2680 

■^■Fundamental  frequencies  for  the  filtered  vowels  are  the  same  as 
those  reported  for  the  voiced  vowels. 
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TABLE  XVI,     Fl^'DAI^IENTAL  FREQUENCY,  FORl-lANT  FREQUENCY,  AND 
FORJ^IANT  BANDWIDTHS  (Hz)  FOR  THE  CONSONANTS 


SPEAKER 
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/v/  Fi 

148 
280 
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132 
320 
720 

116 
440 
1120 

98 
280 
920 

144 
280 
1000 

112 
320 
1280 

108 
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1200 

114 
280 
1000 

/z/  ¥i 

125 
280 
760 

128 

320 
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117 

360 
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103 
400 
800 

143 
320 
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115 
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111 
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960 
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640 
3120 
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1680 
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560 
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360 
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920 
2840 
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800 
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2440 
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2720 
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1120 
2840 
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TABLE  XVII.     FUNDAMENTAL  FREQUENCY  AND  FOR>L\NT  FREQUENCY 
(Hz)  MEASURES  FOR  THE  MONOSYLLABLE,  /va/ 


SPEAKER 
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APPENDIX  E 
CONFUSIONS  AMONG  SPEAKERS 
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Figure  17:     Confusions  among  speakers  for  voiced  / i/ . 
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Figure  18:     Confusions  among  speakers  for  voiced  /u/. 


88 


89 


Actua.  1 

Cfc.                 O  ^ 

opGa.K.e 

V 

1 

2 

3 

4 

5 

6 

7 

8 

1 

25 

3 

4 

7 

8 

3 

4 

5 

u 

<u 

2 

1 

50 

1 

6 

2 

1 

15 

12 

'ed  Speak 

3 

12 

1 

20 

10 

1 

12 

0 

3 

4 

6 

0 

1 

5 

2 

0 

4 

5 

o 

5 

0 

2 

4 

0 

45 

6 

1 

0 

flj 

6 

10 

0 

29 

10 

1 

33 

0 

3 

7 

2 

4 

0 

8 

0 

3 

34 

9 

8 

4 

0 

1 

14 

1 

2 

2 

23 

Figure  19:     Confusions  among  speakers  for  voiced  7^7. 
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Figure  20:     Confusions  among  speakers  for  voiced  /a/. 
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Figure  21:     Confusions  among  speakers  for  whispered  /i/. 


Actual  Speaker 
3         4         5  6 


15 
5 
5 
5 
2 
4 
12 
11 


7 
17 
11 
4 
5 
7 
4 
5 


11 
4 
7 
3 
1 
8 
9 

17 


9 
6 
5 

15 
5 
4 

10 
6 


7 
2 
7 

10 
8 

14 


2 
0 

14 
3 
7 

15 
8 

11 


13 
2 

15 
7 
7 
5 
7 
4 


8 
3 
3 
7 
5 
9 
13 
11 
9 


P'igure  22:     Confusions  among  speakers  for  whispered  /u/. 


91 


1 

15 
2 

12 
6 
6 

14 
1 
4 


Actual  Speaker 
3         4         5  6 


8 


10 
41 
0 
3 
1 
0 
1 
4 


11 
7 
6 
4 
8 

12 
5 
7 


11 
2 
5 
6 
18 
10 
5 
3 


14 
16 
6 
9 
1 
3 
4 
7 


8 
5 
3 
3 
13 
17 
4 
7 


14 
5 
6 
3 
7 

15 
2 


12 
4 

18 
7 
6 

12 
1 
0 


Figure  23:     Confusions  among  speakers  for  whispered  /^Q./, 
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Figure  24:     Confusions  among  speakers  for  whispered  /a/, 
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Figure  25:     Confusions  among  speakers  for  filtered  /i/. 
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Figure  26:     Confusions  among  speakers  for  filtered  /u/. 
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Figure  27:     Confusions  among  speakers  for  filtered  7*^/ . 
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Figure  28:     Confusions  among  speakers  for  filtered  /a/. 
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Figure  29:     Confusions  among  speakers  for  /s/. 
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Figure  30:     Confusions  among  speakers  for  /z/. 
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Figure  31:     Confusions  among  speakers  for  /f/. 
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Figure  32:     Confusions  among  speakers  for  /v/. 
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Figure  33:     Confusions  among  speakers  for  /v+a/. 
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Figure  34:     Confusions  among  speakers  for  /va/. 
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